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1  Research  Summary 

1.1  Project  goals 

The  main  goals  for  our  project  have  been: 

1.  to  develop  automated  information  organization  algorithms 

2.  to  integrate  the  information  organization  algorithms  in  a  mobile  agent  platform 

We  have  fulfilled  both  goals  successfully.  The  following  sections  detail  our  results. 

1.2  Background 

Information  in  electronic  form  is  proliferating  rapidly  in  a  variety  of  forms.  We  now  have  powerful 
search  engines  that  can  return  pointers  to  millions  of  documents  on  any  subject.  How  can  we 
tap  into  this  fortune,  while  avoiding  information  overload? 

The  productivity  and  success  of  an  individual  in  a  society  innundated  with  electronic  data 
will  be  largely  determined  by  timely  access  to  information.  This  is  particularly  challenging  when 
the  data  is  unstructured,  active,  and  heterogeneous.  It  seems  unlikely  that  we  could  package 
information  in  a  standardized  form  for  the  purposes  of  extraction  and  interpretation,  because 
people’s  information  needs  are  varied  and  the  production  of  information  is  easy.  The  production 
of  information  is  such  that  “manufacturing”  facilities  can  be  moved  easily  at  little,  or  no  cost, 
giving  rise  to  transient  datasources.  Just  as  the  invention  of  the  railroad  (and  other  means 
of  mass  transportation)  has  made  it  possible  for  comsumers  to  obtain  products  quickly,  our 
vision  is  to  provide  ubiquitous,  customized,  and  organized  access  to  all  users.  To  this  end,  we 
advocate  technologies  for  systems  in  which  customers  can  express  information  needs  in  flexible 
ways,  and  that  provide  facilities  for  an  intelligent  and  customized  exploration  of  the  Web  and 
other  information  spaces. 

To  build  useful  tools  for  tapping  into  the  vast  evolving  net  of  information  resources,  we  need 
to  address  two  fundamental  issues. 

a.  Data  access  in  modern  computing  environments:  how  do  we  access  information  in  a 
computing  environment  defines  by  dynamic  wireless  compute  platforms  and  transient  databases? 

Computers  are  departing  from  their  traditional  desk-top  configurations  and  becoming  more 
portable.  We  now  have  wireless  computers  and  palm-top  computers  that  can  interface  with 
the  rest  of  the  elctronic  world  independent  of  their  physical  location.  In  an  environment  with 
ubiquitous  computers,  we  would  like  to  provide  ubiquitous  access  to  computers  and  information. 
Sample  applications  include  anywhere,  anytime  communication,  flexible  scheduling,  smart  rooms, 
embedded  devices,  support  for  collaborative  decision  making,  etc.  •  The  key  research  questions 
that  need  to  be  addressed  to  support  such  applications  include:  (1)  what  is  the  hardware  infras¬ 
tructure  best  suited  for  this  task?;  (2)  how  can  we  provide  active  networking  in  a  dynamic  system 
of  computers?;  and  (3)  how  can  we  locate  users  and  forward  information  when  the  originating 
host  might  become  disconnected? 

b.  Data  evolution:  how  do  we  organize  an  ever-changing  information  space? 
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As  information  systems  grow,  they  need  to  autonomously  reorganize  themselves  to  effectively 
meet  requests  for  information.  Such  reorganization  could  involve  simple  processes  like  active 
selection  and  caching  of  information  views  that  facilitate  query  processing,  to  more  complex 
processes  that  generate  and  maintain  active  indexes  into  information.  Sample  applications  in¬ 
clude  automatic  capture  and  access  tasks  in  digital  libraries.  Information  organization  algorithm 
can  be  used  to  organize  any  collection  of  documents  in  a  customized  way.  Users  can  organize 
their  email  files.  Large  corporations  can  organize  their  manuals,  news  releases,  and  internal 
documents.  Such  a  system  can  be  used  to  create  “Web  Centers”  and  “Yellow  Pages”  automat¬ 
ically  and  provide  users  with  better  interfaces.  The  system  can  alo  be  used  as  a  front  end  to  a 
search  engine.  The  key  to  supporting  such  applications  is  reasearch  that  would  lead  to  efficient 
algorithms  for  information  organization  that  theoretically  well-grounded  and  creating  flexible, 
modular,  and  customizable  systems  that  use  these  algorithms 

To  address  these  issues  effectively,  we  need  new  ways  of  conceptualizing  and  communicat¬ 
ing  information  needs.  We  believe  that  a  good  approach  is  a  computational  paradigm  relying 
on  customizable  mobile  information  agents.  By  agents  we  mean  autonomous  decision-making 
programs  that  migrate  from  host  to  host  under  their  own  control.  By  customizable  we  mean  pro¬ 
grams  that  evolve  automatically  with  changes  in  the  information  landscape,  as  well  as  programs 
that  can  be  easily  modified  and  assembled  by  users,  according  to  their  tasks.  By  information 
agents  we  mean  dedicated  program  that  can  run  sophisticated  information  capture,  access,  and 
organization  algorithms. 

Our  main  objective  has  been  to  investigate  and  demonstrate  the  value  of  a  paradigm  of  compu¬ 
tation  in  heterogeneous  distributed  systems  with  non-permanent  network  connections,  in  which 
mobile  agents  bring  the  computation  to  the  data.  A  mobile  agent  is  an  automous  program  that 
can  migrate  from  machine  to  machine  in  a  heterogeneous  network,  at  times  and  to  places  of 
its  own  choosing.  This  navigation  autonomy  is  very  powerful,  and  requires  an  agent  to  have 
substantial  intelligence  in  making  decisions  and  filtering  information. 

We  have  built  a  system  called  D’Agents  that  supports  mobile  agents.  D’Agents  is  especially 
suited  to  distributed  information  access  experiments  in  a  network  of  mobile  computers,  such 
as  laptops,  palmtops,  and  other  wireless  devices.  Mobile  computers  do  not  have  a  permanent 
connection  into  the  network  and  are  often  disconnected  for  a  long  period  of  time.  We  focus  on 
applications  that  require  extensive  data  processing  over  distributed  and  transient  databases  in 
wireless  networks.  We  look  for  algorithms  that  allow  agents  to  retrieve  and  organize  the  relevant 
data  as  naturally  occuring  hierarchies  of  topics  and  subtopics. 

Mobile  agents  provide  a  convenient,  efficient,  and  intelligent  paradigm  for  implementing  dis¬ 
tributed  applications,  especially  in  the  context  of  wireless  computing.  First,  by  migrating  to  the 
location  of  an  electronic  resource,  an  agent  can  access  the  resource  locally  and  eliminate  costly 
data  transfers  over  congested  networks.  This  reduces  network  traffic,  because  it  is  often  cheaper 
to  send  a  small  agent  to  a  data  source  than  to  send  all  the  intermediate  data  to  the  requesting 
site.  Second,  the  agent  does  not  require  a  permanent  connection  to  the  host  machine  ( e.g .,  the 
computer  from  where  an  agent  is  launched).  This  capability  supports  distributed  information¬ 
processing  applications  on  mobile  computers.  Third,  the  network-sensing  capabilities  enable 
agents  to  autonomously  find  the  host  computer,  even  when  the  host  changes  its  geographical 
location.  Fourth,  the  network  software-  and  hardware-sensing  capabilities  permit  transportable 
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agents  to  navigate  adaptively.  Fifth,  our  transportable  agents  can  communicate  with  each  other 
even  when  they  do  not  know  their  specific  locations  in  the  network.  Finally,  agents  have  auton¬ 
omy  in  decision  making:  by  using  feedback  from  visiting  a  site,  they  can  independently  modify  the 
overall  plan  or  refine  ill-specified  queries.  When  combined  with  communication,  decision-making 
enables  our  agents  to  be  negotiators.  D’Agents  supports  negotiation  through  an  infrastructure 
that  supports  transactions  on  electronic  cash,  arbitration  on  electronic  cash  transactions,  and 
economic  policies  for  resource  control. 

Although  this  contract  has  not  supported  the  entire  development  of  the  D’Agents  system,  it 
has  supported  a  small  portion  of  it.  The  rest  of  the  project  was  supported  by  Darpa,  AFOSR, 
and  ONR.  We  used  the  D’Agents  system  as  a  lpatform  in  evaluating  some  of  the  applications  of 
our  information  organization  algorithms  described  in  the  next  section. 

1.3  Information  Organization 

For  this  aspect  of  our  work,  are  motivated  by  a  long-term  vision  in  which  information  systems 
can  help  leaders  make  decisions  by  collecting,  filtering,  updating,  and  presenting  information 
quickly,  accurately,  and  effectively.  Information  systems  will  compute  the  underlying  topic- 
subtopic  structure  of  dynamic  textual  databases.  As  new  information  comes  into  the  database, 
the  system  will  fuse  it  with  the  existing  topic  structure.  The  system  will  also  be  able  to  remove 
documents  from  the  database. 

Our  work  focuseson  a  paradigm  for  organizing  data  that  can  be  used  as  a  pre-processing  step 
in  a  static  information  system  or  as  a  post-processing  step  on  the  specific  documents  retrieved  by 
a  query.  As  a  pre-processor,  this  system  assists  users  with  deciding  how  to  browse  the  corpus  by 
highlighting  relevant  topics  and  irrelevant  subtopics.  Such  clustered  data  is  useful  for  narrowing 
down  the  corpus  over  which  detailed  queries  can  be  formulated.  As  a  post-processor,  this  system 
classifies  the  retrieved  data  into  clusters  that  capture  topic  categories  and  subcategories. 

We  have  developed,  implemented,  and  evaluated  an  information  organization  algorithm  called 
the  star  algorithm.  The  star  algorithm  gives  an  organization  of  a  collection  into  clusters.  Each 
level  in  the  hierarchy  is  determined  by  a  threshold  for  the  minimum  similarity  between  pairs  of 
documents  within  a  cluster  at  that  particular  level  in  the  hierarchy.  This  method  conveys  the 
topic-subtopic  structure  of  the  corpus  according  to  the  similarity  measure  used. 

The  problem  can  be  formulated  by  representing  an  information  system  by  its  similarity  graph. 
A  similarity  graph  is  an  undirected,  weighted  graph  G  =  (V,E,w)  where  vertices  in  the  graph 
correspond  to  documents  and  each  weighted  edge  in  the  graph  corresponds  to  a  measure  of 
similarity  between  two  documents.  We  measure  the  similarity  between  two  documents  by  using 
the  cosine  metric  in  the  vector  space  model  of  the  Smart  information  retrieval  system.  G  is 
a  complete  graph  with  edges  of  varying  weight.  An  organization  of  the  graph  that  produces 
reliable  clusters  of  similarity  sigma  (i.e. ,  clusters  where  documents  pairwise  have  similarities  of 
at  least  sigma)  can  be  obtained  by  performing  a  minimum  clique  cover  of  all  edges  whose  weights 
are  above  the  threshold  sigma.  Unfortunately,  this  approach  is  computationally  intractable.  For 
real  corpora,  these  graphs  can  be  very  large.  The  clique  cover  problem  is  NP-complete,  and  it 
does  not  admit  polynomial-time  approximation  algorithms.  While  we  cannot  perform  a  clique 
cover  nor  even  approximate  such  a  cover,  we  can  instead  cover  our  graph  by  dense  subgraphs. 
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Specifically,  we  use  star-shaped  subgraphs.  A  star-shaped  subgraph  on  m+1  vertices  consists  of 
a  single  star  center  and  m  satellite  vertices,  where  there  exist  edges  between  the  star  center  and 
each  of  the  satellite  vertices.  The  star-cover  algorithm  is  provably  accurate  in  that  it  produces 
dense  clusters  with  provable  guarantees  on  the  pairwise  similarity  between  cluster  documents, 
and  it  can  be  quickly  computed.  The  documents  in  each  cluster  are  tightly  inter-related  and  a 
minimum  similarity  distance  between  all  the  document  pairs  in  the  cluster  is  guaranteed.  This 
resulting  structure  reflects  the  underlying  topic  structure  of  the  data.  A  topic  summary  for  each 
cluster  is  provided  by  the  center  of  the  underlying  star  for  the  cluster. 

This  approach  has  three  nice  features.  First,  by  using  star-shaped  graphs  to  cover  the  similarity 
graph,  we  are  guaranteed  that  all  the  documents  in  a  cluster  have  the  desired  degree  of  similarity. 
Second,  covering  the  edges  of  the  graph  allows  vertices  to  belong  to  several  clusters.  Documents 
can  be  members  of  multiple  clusters,  which  is  a  desirable  feature  when  documents  have  multiple 
subthemes.  Third,  this  algorithm  can  be  iterated  for  a  range  of  thresholds,  effectively  producing 
a  hierarchical  organization  structure  for  the  information  system.  Each  level  in  the  hierarchy 
summarizes  the  collection  at  a  granularity  provided  by  the  threshold. 

We  have  developed  a  prototype  system  for  doing  this  task.  We  have  experimented  with  this 
system  and  found  that  the  precision-recall  is  higher  than  the  precision-recall  of  other  techniques 
such  as  the  single  link  method,  average-link  method,  and  the  k-means  method.  We  are  currently 
working  on  an  on-line  version  of  this  system  that  could  organize  dynamically-changing  document 
collections. 

These  algorithms  can  be  used  to  create  automatically  knowledge  bases.  A  set  of  raw  documents 
are  indexed  to  create  information  bases.  By  clustering  an  information  base  and  summarizing  the 
results  of  the  organized  collection  we  add  a  higher-level  of  knowledge  into  the  database.  This 
type  of  knowledge  can  be  used  to  reduce  information  overload  and  have  applications  in  a  variety 
of  tasks,  such  as  customized  filtering  of  information,  topics  detection  and  tracking  in  continuous 
information  streams,  collaborative  decision  making,  etc. 

The  off-line  and  on-line  star  algorithms  can  be  optimized  further  for  better  performance.  Note 
that  both  versions  of  the  algorithm  rely  on  the  existence  of  the  similarity  matrix.  Similarity 
matrices  can  get  very  large:  for  a  document  set  with  n  documents  the  similarity  matrix  is  0(n2) 
space  data  structure.  However,  this  operation,  which  takes  0(n2)  time  to  compute,  is  much  more 
expensive  than  the  basic  cost  of  the  star  clustering  algorithm,  which  is  approximately  0(V  +  E) 
time.  Thus,  it  is  clear  that  the  similarity  matrix  is  a  bottleneck.  Computing  this  matrix  is  a  one¬ 
time  pre-processing  operation.  However,  the  data  structure  has  to  be  available  on  a  permanent 
basis.  For  these  reasons,  it  is  important  to  consider  methods  that  improve  on  the  similarity 
matrix  bottleneck. 

We  have  developed,  implemeted,  and  started  testing  an  extension  that  approximates  the  star 
algorithm  by  using  sampling  to  compute  the  similarity  matrix.  The  basic  idea  is  to  create  a 
sample  of  the  document  collection  that  is  much  smaller  than  the  actual  collection.  This  sample 
can  then  be  used  to  compute  a  complete  Star  Clustering,  using  the  off-line  star  algorithm  and 
the  remaining  documents  can  be  inserted  in  the  resulting  structure.  An  additional  optimization 
is  to  remove  entirely  the  similarity  matrix.  The  key  information  used  by  the  star  algorithm  is 
the  degree  of  the  nodes  in  the  thresholded  similarity  graph.  This  information  can  be  represented 
in  an  array  and  it  can  be  computed  approximately,  by  sampling. 
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Another  bottleneck  for  the  star  algorithm  comes  up  in  Internet  applications,  such  as  organiz¬ 
ing  data  collected  from  various  sited  and  databases  by  topic.  Consider  a  task  in  which  several 
databases  are  queried  with  the  same  question.  The  documents  returned  by  these  queries  are  to 
be  fused  and  presented  to  the  user  in  a  coherent  picture.  One  approach  is  to  run  the  queries, 
download  all  documents,  and  organize  the  entire  collection  at  the  user  site  using  the  star  algo¬ 
rithm.  An  alternative  approach  is  to  run  the  queries,  organize  the  search  results  at  the  location 
of  the  database,  and  then  merge  these  results  on  the  user  machine.  This  second  alternative  has 
several  advantages:  (1)  the  star  algorithm  can  be  run  in  parallel,  which  provides  a  speedup,  (2) 
the  document  transfer  operation  can  also  be  parallelized  (note  that  if  the  number  of  documents 
is  large  and  the  network  bandwidth  is  low,  the  cost  of  the  transfer  can  be  overwhelming);  and  (3) 
the  local  topic  organizations  can  be  viewed  as  a  way  of  compressing  the  documents,  can  be  used 
to  generate  the  merged  topics  in  the  distributed  collection,  and  can  be  transfered  much  faster 
than  the  actual  documents  to  the  user’s  machine. 

For  these  reasons,  we  developed  a  third  approximation  of  the  star  algorithm  called  the  dis¬ 
tributed  star  algorithm,  which  is  useful  especially  when  the  document  collection  is  very  large.  The 
distributed  star  algorithm  provides  parallelism  and  is  based  on  a  “divide  and  conquer”  approach. 
The  document  collection  is  partitioned  into  several  disjoint  sets.  The  sets  are  clustered  sepa¬ 
rately  and  the  resulting  clusters  are  then  merged.  We  are  currently  implementing  this  distributed 
version  of  the  information  organization  system  with  D’agents.  We  plan  to  use  this  integrated 
version  of  the  system  as  an  application  on  top  of  Serval,  a  large  distributed  database  of 

In  another  project,  we  started  to  investigate  a  new  information-theoretic  model  for  document 
retrieval  and  clustering.  In  this  model,  a  collection  of  text  documents  (the  “corpus”)  is  first 
analyzed  to  determine  a  probability  model  for  the  terms  within  the  corpus.  Terms  that  appear 
infrequently  have  relatively  low  assigned  probabilities,  while  terms  that  appear  frequently  have 
relatively  high  assigned  probabilities.  The  Shannon  information  is  then  computed  for  each  of  the 
terms  in  the  corpus — it  is  simply  the  length  of  the  codeword  (in  bits)  assigned  to  each  term  in  the 
optimal  encoding  scheme  for  compressing  or  transmitting  the  corpus.  The  Shannon  information 
can  be  efficiently  computed  from  the  corpus  probability  model. 

Given  a  query  in  the  form  of  a  collection  of  keywords,  we  can  perform  document  retrieval  by 
determining  the  total  number  of  bits  that  each  document  contains  about  the  given  keywords  and 
returning  relevant  documents  ranked  according  to  this  measure.  For  each  document,  this  bit  total 
can  be  calculated  by  summing,  for  each  keyword,  the  product  of  its  Shannon  information  times 
the  frequency  with  which  that  keyword  appears  in  the  document  (normalized  in  such  a  way  that 
“short”  documents  and  “long”  documents  are  treated  equally).  Clustering  can  be  achieved  via 
an  information-theoretic  similarity  measure  which  can  be  derived  within  this  model.  Essentially, 
the  pairwise-similarity  between  two  documents  corresponds  to  the  fraction  of  keyword  bits  that 
the  two  documents  share  in  common. 

We  have  implemented  a  system  employing  these  ideas  on  a  large  corpus  containing  some 
130,000  documents.  So  far  our  results  are  encouraging,  both  in  terms  of  accuracy  (the  “quality” 
of  retrieved  documents)  and  efficiency  (query  retrieval  on  100,000+  documents  in  a  fraction  of 
a  second  on  a  PC).ject,  we  are  currently  investigating  a  new  information-theoretic  model  for 
document  retrieval  and  clustering.  In  this  model,  a  collection  of  text  documents  (the  “corpus  ) 
is  first  analyzed  to  determine  a  probability  model  for  the  terms  within  the  corpus.  Terms  that 
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appear  infrequently  have  relatively  low  assigned  probabilities,  while  terms  that  appear  frequently 
have  relatively  high  assigned  probabilities.  The  Shannon  information  is  then  computed  for  each 
of  the  terms  in  the  corpus  it  is  simply  the  length  of  the  codeword  (in  bits)  assigned  to  each 
term  in  the  optimal  encoding  scheme  for  compressing  or  transmitting  the  corpus.  The  Shannon 
information  can  be  efficiently  computed  from  the  corpus  probability  model. 

Given  a  query  in  the  form  of  a  collection  of  keywords,  we  can  perform  document  retrieval  by 
determining  the  total  number  of  bits  that  each  document  contains  about  the  given  keywords  and 
returning  relevant  documents  ranked  according  to  this  measure.  For  each  document,  this  bit  total 
can  be  calculated  by  summing,  for  each  keyword,  the  product  of  its  Shannon  information  times 
the  frequency  with  which  that  keyword  appears  in  the  document  (normalized  in  such  a  way  that 
short  documents  and  “long”  documents  are  treated  equally).  Clustering  can  be  achieved  via 
an  information-theoretic  similarity  measure  which  can  be  derived  within  this  model.  Essentially, 
the  pairwise-similarity  between  two  documents  corresponds  to  the  fraction  of  keyword  bits  that 
the  two  documents  share  in  common. 

2  Lessons  Learned 

This  effort  has  uncovered  some  valuable  lessons  for  the  computer  science  community,  for  the 
airforce,  and  for  the  population  at  large. 

1.  Information  overload  is  a  serious  problem  and  efficient  automatic  information  organization 
algorithms  are  useful  in  addressing  this  problem. 

2.  The  Star  Clustering  algorithm  is  the  best  performing  algorithm  for  large-scale  information 
organization. 

3.  The  Star  clustering  algorithm  can  be  used  in  an  on-line  or  off-line  fashinon  and  has  several 
scalable  extensions. 

4.  The  Star  clustering  algorithm  has  been  analyzed  and  our  large-scale  experiments  match 
the  theory. 

5.  The  Star  clustering  algorithm  can  be  used  for  filtering  applications  and  for  persistent 
queries. 

6.  By  combining  the  Star  clustering  algorithm  with  the  power  of  mobile  agent  system  we  in¬ 
crease  system  performance  dramatically.  Speciffically,  we  conserve  bandwith  by  transfering 
the  code  to  the  data,  performing  data  processing  at  the  site  of  the  data,  and  bringing  back 
only  the  relevant  results.  In  addition,  mobile  agents  support  multiple  queries  without  con¬ 
necting  the  the  home  machine  and  thus  contribute  to  the  reduction  of  the  total  completion 
time  of  a  job.  Finally,  mobile  agents  support  disconnected  queries,  in  low-latency  wireless 
networks. 
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3  Students 

The  following  students  were  supported  on  this  contract: 

•  Ekaterina  Pelekhov,  PhD  student,  thesis  defended  in  May  2000;  expecting  the  final  version. 

•  Mark  Montague,  PhD  student,  thesis  expected  in  May  2001. 

•  Ken  Yasuhara,  undergraduate  student,  currently  a  PhD  student  in  the  computer  science 
department  at  the  University  of  Washington. 

In  addition  to  these  students,  profs.  Jay  Aslam,  prof.  David  Kotz,  and  prof.  Daniela  Rus 
were  also  supported  in  part  by  this  contract. 

4  Software 

We  designed  and  implemented  a  mobile-agent  system  called  D’Agents 

(see  http://www.cs.dartmouth.edu/~agent/agenttcl.html).  We  have  completed  several  releases 
of  this  system  that  has  security  mechanism  for  protecting  machines  from  malicious  agents  and 
several  additional  capabilities  for  agents  over  the  previous  release.  These  versions  support  Agent 
Tel,  Agent  Java,  and  Agent  Scheme  as  programming  languages. 

We  also  designed  and  implemented  a  system  that  supports  automated  information  organiza¬ 
tion  in  static  and  dynamic  environments,  filtering  on  a  text  stream  and  persistent  queries.  The 
system  has  a  novel  graphical  user  interface  that  projects  the  topic  content  of  the  corpus  onto  a  2- 
dimensional  window,  thus  supporting  intuitive  browsing  to  cope  with  information  overload.  The 
information  organization  software  is  available  from  http: / /www. cs.dartmouth.edu/  rus/Software/info 
org.tar.gz. 


5  Papers 

The  following  papers  resulted  as  part  of  this  project: 

“Automatic  Information  Organization”  (with  J.  Aslam  and  K.  Pelekhov),  in  Proceedings  of  the 
2000  SSGCC  (book  with  CD  ROM). 

“Mobile  agents:  motivations,  state  of  the  art,  and  frontiers”  (with  G.  Cybenko,  R.  Gray,  and  D. 
Kotz),  in  eds.  J.  Bradshaw,  Handbook  of  Agent  Technologies ,  MIT  Press,  1999  (to  appear). 

“Generating,  visualizing,  and  evaluating  high-quality  clusters  for  information  organization” 
(with  J.  Aslam  and  K.  Pelekhov),  in  Proceedings  of  Principles  of  Digital  Document  Pro¬ 
cessing  eds.  E.  Munson,  C.  Nicholas,  D.  Wood,  Lecture  Notes  in  Computer  Science  1481, 
Springer- Verlag  1998. 

“Applications  of  clustering  to  filtering  and  persistent  queries”  (with  J.  Aslam  and  K.  Pelekhov), 
in  Proceedings  of  CIKM  2000  (November  2000). 
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“Scalable  Information  Organization”  (with  J.  Aslam  and  F.  Reiss),  in  Proceedings  of  RIAO 
2000  (Content-based  information  access)  (April  2000). 

“A  practical  clustering  algorithm  for  static  and  dynamic  information  organization”  (with  J. 
Aslam  and  K.  Pelekhov),  in  the  1999  Symposium  on  Discrete  Algorithms  (SODA99),  Bal¬ 
timore,  MD  (January  1999). 

“Static  and  Dynamic  Information  Organization  with  Star  Clusters”  (with  J.  Aslam  and  K. 
Pelekhov),  In  Proceedings  of  the  1998  Conference  on  Intelligent  Knowledge  Management, 
Washington,  DC  (November  1998). 

6  Talks 

“Mobile  Information  Agents”,  D.  Rus,  Caterpillar,  Peoria,  IL,  December  1998. 

“Mobile  Information  Agents” ,  D.  Rus,  Rome  Labs,  December  1998. 

“Mobile  Information  Agents”,  D.  Rus,  Qualcom  distinguished  lecture,  The  University  of 
California  at  San  Diego,  February  1999. 

“Mobile  Information  Agents” ,  panel  on  modern  information  technologies  moderated  by  Joe 
Cavano,  COMPSAC  99,  October  99. 

“Scalable  Extensions  of  the  Star  Algorithm”,  F.  Reiss,  RIAO  2000,  April  2000. 
“Information  Organization  Algorithms”,  D.  Rus,  SGGRR  2000,  L’Aquila  Italy,  July  2000. 
“Using  the  star  clustering  algorithm  for  filtering”,  J.  Aslam,  CIKM  2000,  November  2000. 
“D’Agents:  a  secure  mobile  agent  system”,  D.  Rus,  NRL,  December  2000. 


7  Service  to  the  Community 

•  D.  Rus,  Treasurer,  The  1999  International  Conference  on  Autonomous  Agents 

•  D.  Rus,  Program  Committee,  The  Workshop  on  Mobile  Agents  in  the  Context  of  Compe¬ 
tition  and  Cooperation,  1999 

•  D.  Kotz,  Program  Committee,  The  Workshop  on  Mobile  Agents  in  the  Context  of  Compe¬ 
tition  and  Cooperation,  1999 

•  D.  Rus,  Senior  Program  Committee,  1999  International  Joint  Conference  on  Artificial  In¬ 
telligence  (IJCAI99) 

•  D.  Rus,  General  Chair,  Dartmouth  Workshop  on  Mobile  Agents,  1999,  2000 

•  D.  Rus,  Program  Committee  SIGIR 

•  D.  Rus,  Senior  Program  Committee,  2001  International  Conference  on  Autonomous  Agents 
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8  Interactions  with  other  Agencies 

We  are  working  with  Darpa  as  part  of  the  Co-Abs  project  and  with  the  Air  Force  as  part  of  a 
MURI  project.  We  have  integrated  the  information  organization  system  we  developed  as  part 
of  this  contract  in  our  MURI  demo  and  hope  to  do  some  integration  with  the  Darpa  Grid  and 
perhaps  a  Fleet  Battle  Experiment.  We  are  looking  for  more  venues  to  transition  this  work. 
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Static  and  Dynamic  Information  Organization 
with  Star  Clusters 

Javed  Aslam  Katya  Pelekhov  Daniela  Rus 

Department  of  Computer  Science 
Dartmouth  College 
Hanover,  NH  03755 


Abstract 

In  this  paper  we  present  a  system  for  static  and  dy¬ 
namic  information  organization  and  show  our  evaluations 
of  this  system  on  TREC  data.  We  introduce  the  off-line 
and  on-line  star  clustering  algorithms  for  information  or¬ 
ganization.  Our  evaluation  experiments  show  that  the  off¬ 
line  star  algorithm  outperforms  the  single  link  and  average 
link  clustering  algorithms.  Since  the  star  algorithm  is  also 
highly  efficient  and  simple  to  implement,  we  advocate  its 
use  for  tasks  that  require  clustering,  such  as  information 
organization,  browsing,  filtering,  routing,  topic  tracking, 
and  new  topic  detection. 

1  Introduction 

Modern  information  systems  have  vast  amounts  of  un¬ 
organized  data  that  change  dynamically.  Consider,  for  ex¬ 
ample,  the  flow  of  information  that  arrives  continuously 
on  news  wires,  or  is  aggregated  by  a  news  organization 
such  as  CNN.  Some  stories  are  new  while  other  stories  are 
follow-ups  on  previous  stories.  Yet  another  type  of  sto¬ 
ries  make  previous  reportings  obsolete.  The  news  focus 
changes  regularly  with  this  flow  of  information.  In  such 
dynamic  systems,  users  need  to  locate  information  quickly 
and  efficiently. 

Current  information  systems  such  as  Inquery  [Tur90], 
Smart  [Sal91]  and  Alta  Vista  provide  some  simple  automa¬ 
tion  by  computing  ranked  (sorted)  lists  of  documents,  but 
it  is  ineffective  for  users  to  scan  a  list  of  hundreds  of  docu¬ 
ment  titles.  To  cull  the  critical  information  out  of  a  large 
set  of  potentially  useful  dynamic  sources,  we  need  meth¬ 
ods  for  organizing  information  to  highlight  the  topic  con¬ 
tent  of  a  collection  and  reorganize  the  data  to  adapt  to 
the  incoming  flow  of  documents.  Such  information  organi¬ 
zation  algorithms  would  support  incremental  information 
processing  tasks  such  as  routing,  topic  tracking  and  new 
topic  detection  in  a  stream  of  documents. 

In  this  paper,  we  present  a  system  for  the  static  and 
dynamic  organization  of  information  and  we  evaluate  the 


system  on  TREC  data.  We  introduce  the  off-line  and  on¬ 
line  star  clustering  algorithms  for  information  organiza¬ 
tion.  We  also  describe  a  novel  method  for  visualizing  clus¬ 
ters,  by  embedding  them  in  the  plane  so  as  to  capture  their 
relative  difference  in  content.  Our  evaluation  experiments 
show  that  the  off-line  star  algorithm  outperforms  the  sin¬ 
gle  link  and  average  link  clustering  algorithms.  Since  the 
star  algorithm  is  also  highly  efficient  and  simple  to  imple¬ 
ment,  we  advocate  its  use  for  tasks  that  require  clustering, 
such  as  information  organization,  routing,  topic  tracking, 
and  new  topic  detection. 

1.1  Previous  Work 

There  has  been  extensive  research  on  clustering  and  ap¬ 
plications  to  many  domains  [HS86,  AB84],  For  a  good 
overview  see  [JD88] .  For  a  good  overview  of  using  cluster¬ 
ing  in  information  retrieval  see  [Wil88], 

The  use  of  clustering  in  information  retrieval  was  mostly 
driven  by  the  cluster  hypothesis  [Rij79]  which  states  that 
relevant  documents  tend  to  be  more  closely  related  to  each 
other  than  to  non-relevant  documents.  Efforts  have  been 
made  to  determine  whether  the  cluster  hypothesis  is  valid. 
Voorhees  [Voo85]  discusses  a  way  of  evaluating  whether 
the  cluster  hypothesis  holds  and  shows  negative  results. 
Croft  [Cro80]  describes  a  method  for  bottom-up  cluster 
search  that  could  be  shown  to  outperform  a  full  rank¬ 
ing  system  for  the  Cranfield  collection.  The  single  link 
method  [Cro77]  does  not  provide  any  guarantees  for  the 
topic  similarity  within  a  cluster.  Jardine  and  van  Rijsber- 
gen  [JR71]  show  some  evidence  that  search  results  could 
be  improved  by  clustering.  Hearst  and  Pedersen  [HP96] 
re-examine  the  cluster  hypothesis  by  focusing  on  the  Scat¬ 
ter/Gather  system  [CKP93]  and  conclude  that  it  holds  for 
browsing  tasks. 

Systems  like  Scatter/Gather  [CKP93]  provide  a  mech¬ 
anism  for  user-driven  organization  of  data  into  a  fixed 
number  of  clusters,  but  user  feedback  is  required  and  the 
computed  clusters  do  not  have  accuracy  guarantees.  Scat¬ 
ter/Gather  uses  fractionation  to  compute  nearest-neighbor 
clusters.  In  a  recent  paper,  Charika  et  aJ.  [CCFM97]  con¬ 
sider  a  dynamic  clustering  algorithm  to  partition  a  col¬ 
lection  of  text  documents  into  a  fixed  number  of  clusters. 
However,  since  the  number  of  topics  in  a  dynamic  infor¬ 
mation  systems  is  not  generally  known  a  priori,  a  fixed 
number  of  clusters  cannot  generate  a  natural  partition  of 
the  information. 
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1.2  Our  Work 

Our  work  on  clustering  presented  in  this  paper  and 
in  [APR98]  describes  a  simple  incremental  algorithm,  pro¬ 
vides  positive  evidence  for  the  cluster  hypothesis,  and  shows 
promise  for  on-line  tasks  that  require  dynamically  adjust¬ 
ing  the  topic  content  of  a  collection  such  as  filtering,  brows¬ 
ing,  new  topic  detection  and  topic  tracking.  We  propose 
an  off-line  algorithm  for  clustering  static  information  and 
an  on-line  version  of  this  algorithm  for  clustering  dynamic 
information.  These  two  algorithms  compute  clusters  in¬ 
duced  by  the  natural  topic  structure  of  the  space.  Thus, 
this  work  is  different  than  [CKP93,  CCFM97]  in  that  we 
do  not  impose  a  fixed  number  of  clusters  as  a  constraint 
on  the  solution.  As  a  result,  we  can  guarantee  a  lower 
bound  on  the  topic  similarity  between  the  documents  in 
each  cluster. 

To  compute  accurate  clusters,  we  formalize  the  clus¬ 
tering  problem  as  one  of  covering  a  thresholded  similar¬ 
ity  graph  by  cliques.  Covering  by  cliques  is  NP-complete 
and  thus  intractable  for  large  document  collections.  Re¬ 
cent  graph-theoretic  results  have  shown  that  the  problem 
cannot  even  be  approximated  in  polynomial  time  [LY94, 
Zuc93] .  We  instead  use  a  cover  by  dense  subgraphs  that  are 
star-shaped1 ,  where  the  covering  can  be  computed  off-line 
for  static  data  and  on-line  for  dynamic  data.  We  show  that 
the  off-line  and  on-line  algorithms  produce  high-quality 
clusters  very  efficiently.  Asymptotically,  the  running  time 
of  both  algorithms  is  roughly  linear  in  the  size  of  the  sim¬ 
ilarity  graph  that  defines  the  information  space.  We  also 
derive  lower  bounds  on  the  topic  similarity  within  clusters 
guaranteed  by  a  star  covering,  thus  providing  theoretical 
evidence  that  the  clusters  produced  by  a  star  cover  are  of 
high-quality.  We  packaged  these  algorithms  as  a  system 
that  supports  ad-hoc  queries,  static  information  organi¬ 
zation,  dynamic  information  organization,  and  routing.  In 
this  system  we  contributed  a  novel  way  of  visualizing  topic 
clusters  by  using  disks  whose  radii  are  proportional  to  the 
size  of  the  cluster  and  that  are  embedded  in  the  plane  in  a 
way  that  captures  the  topic  distance  between  the  clusters. 
Finally,  we  provide  experimental  data  for  off-line  and  on¬ 
line  topic  organization.  In  particular,  our  off-line  results 
on  a  TREC  collection  indicate  that  star  covers  exhibit  sig¬ 
nificant  performance  improvements  over  either  the  single 
link  [Cro77]  or  average  link  [Voo85]  methods  (21.6%  and 
16.2%  improvements,  respectively,  with  respect  to  a  com¬ 
mon  cluster  quality  measure)  without  sacrificing  simplicity 
or  efficiency. 

1.3  Utility 

Our  algorithms  for  organizing  information  systems  can 
be  used  in  several  ways.  The  off-line  algorithm  can  be 
used  as  a  pre-processing  step  in  a  static  information  sys¬ 
tem  or  as  a  post-processing  step  on  the  specific  documents 
retrieved  by  a  query.  As  a  pre-processor,  this  system  as¬ 
sists  users  with  deciding  how  to  browse  a  database  of  free 
text  documents  by  highlighting  relevant  topics  and  irrele¬ 
vant  subtopics.  Such  clustered  data  is  useful  for  narrowing 
down  the  database  over  which  detailed  queries  can  be  for¬ 
mulated.  As  a  post-processor,  this  system  classifies  the 
retrieved  data  into  clusters  that  capture  topic  categories 
and  subcategories.  The  on-line  algorithm  can  be  used  for 

‘In  [SJJ70]  stars  were  also  identified  to  be  potentially  useful  for 
clustering. 


constructing  self-organizing  information  systems,  for  rout¬ 
ing  problems,  for  topic  detection,  and  for  topic  tracking. 

2  Off-line  Information  Organization 

In  this  section,  we  begin  by  presenting  an  efficient  al¬ 
gorithm  for  off-line  organization  of  information.  We  then 
describe  our  system  built  around  this  algorithm,  including 
user  interface  design  and  visualization  techniques.  Finally, 
we  present  a  performance  evaluation  of  our  organization 
algorithm.  We  begin  by  examining  the  organization  prob¬ 
lem  and  introducing  the  star  algorithm. 

2.1  The  Star  Algorithm 

We  formalize  our  problem  by  representing  an  informa¬ 
tion  system  by  its  similarity  graph.  A  similarity  graph  is  an 
undirected,  weighted  graph  G  —  (V,  E,  w)  where  vertices 
in  the  graph  correspond  to  documents  and  each  weighted 
edge  in  the  graph  corresponds  to  a  measure  of  similarity 
between  two  documents.  We  measure  the  similarity  be¬ 
tween  two  documents  by  using  the  cosine  metric  in  the 
vector  space  model  of  the  Smart  information  retrieval  sys¬ 
tem  [Sal89,  Sal91]. 

G  is  a  complete  graph  with  edges  of  varying  weight.  An 
organization  of  the  graph  that  produces  reliable  clusters  of 
similarity  a  (i.e.,  clusters  where  documents  have  pairwise 
similarities  of  at  least  cr )  can  be  obtained  by  first  thresh¬ 
olding  the  graph  at  a  and  then  performing  a  minimum 
clique  cover  with  maximal  cliques  on  the  resulting  graph 
Ga.  The  thresholded  graph  Ga  is  an  undirected  graph  ob¬ 
tained  from  G  by  eliminating  every  edge  whose  weight  is 
lower  that  cr.  The  minimum  clique  cover  has  two  features. 
First,  by  using  cliques  to  cover  the  similarity  graph,  we 
are  guaranteed  that  all  the  documents  in  a  cluster  have 
the  desired  degree  of  similarity.  Second,  minimal  clique 
covers  with  maximal  cliques  allow  vertices  to  belong  to 
several  clusters.  In  our  information  retrieval  application 
this  is  a  desirable  feature  as  documents  can  have  multi¬ 
ple  subthemes.  However,  the  algorithm  can  also  be  used 
to  compute  non-overlapping  clusters.  In  our  experimen¬ 
tal  evaluations  (see  Figure  4)  we  show  that  the  difference 
in  results  between  star  with  overlapping  clusters  and  star 
without  overlapping  clusters  is  very  small. 

Unfortunately,  this  approach  is  not  tractable  computa¬ 
tionally.  For  real  corpora,  similarity  graphs  can  be  very 
large.  The  clique  cover  problem  is  NP-complete,  and  it 
does  not  admit  polynomial-time  approximation  algorithms 
[LY94,  Zuc93].  While  we  cannot  perform  a  clique  cover  nor 
even  approximate  such  a  cover,  we  can  instead  cover  our 
graph  by  dense  subgraphs.  What  we  lose  in  intra-cluster 
similarity  guarantees,  we  gain  in  computational  efficiency. 
In  this  section  and  the  sections  that  follow,  we  describe 
off-line  and  on-line  covering  algorithms  and  analyze  their 
performance  and  efficiency. 

We  approximate  a  clique  cover  by  covering  the  asso¬ 
ciated  thresholded  similarity  graph  with  star-shaped  sub¬ 
graphs.  A  star-shaped  subgraph  onm  +  1  vertices  consists 
of  a  single  star  center  and  m  satellite  vertices,  where  there 
exist  edges  between  the  star  center  and  each  of  the  satellite 
vertices.  While  finding  cliques  in  the  thresholded  similarity 
graph  Ga  guarantees  a  pairwise  similarity  between  docu¬ 
ments  of  at  least  cr,  it  would  appear  at  first  glance  that 
finding  star-shaped  subgraphs  in  G„  would  provide  simi¬ 
larity  guarantees  between  the  star  center  and  each  of  the 
satellite  vertices,  but  no  such  similarity  guarantees  between 
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For  any  threshold  cr. 

1.  Let  Ga  =  (V,E„)  where  Ea  =  {e  :  w(e)  >  a}. 

2.  Let  each  vertex  in  G, T  initially  be  unmarked. 

3.  Calculate  the  degree  of  each  vertex  v  €  V. 

4.  Let  the  highest  degree  unmarked  vertex  be  a  star 
center  and  construct  a  cluster  from  the  star  center 
and  its  associated  satellite  vertices.  Mark  each 
node  in  the  newly  constructed  cluster. 

5.  Repeat  step  4  until  all  nodes  are  marked. 

6.  Represent  each  cluster  by  the  document  corre¬ 
sponding  to  its  associated  star  center. 


similarity  between  Si  and  S2  must  be  at  least 

cos(ai  +  <*2)  =  cosai  cos  t* 2  —  sinc*i  sina2. 

If  o  —  0.7,  cos <*1  =  0.75  and  cosc*2  =  0.85,  for  in¬ 
stance,  we  can  conclude  that1  the  similarity  between  the 
two  satellite  vertices  must  be  at  least2 

(0.75)  •  (0.85)  -  Vl  ~  (0.75)  Vl  -  (0-85)2  «  0.29. 

While  this  may  not  seem  very  encouraging,  the  above  anal¬ 
ysis  is  based  on  absolute  worst-case  assumptions,  and  in 
practice,  the  similarities  between  satellite  vertices  are  much 
higher.  We  further  undertook  a  study  to  determine  the 
expected  similarity  between  two  satellite  vertices.  Under 
the  assumption  that  “similar”  documents  are  essentially 
random”  perturbations  of  one  another  in  an  appropriate 
vector  space,  we  have  proven  the  following  [APR97]: 


Figure  1 :  The  star  algorithm 


satellite  vertices.  However,  by  investigating  the  geometry 
of  our  problem  in  the  vector  space  model,  we  can  derive 
a  lower  bound  on  the  similarity  between  satellite  vertices 
as  well  as  provide  a  formula  for  the  expected  similarity  be¬ 
tween  satellite  vertices.  The  latter  formula  predicts  that 
the  pairwise  similarity  between  satellite  vertices  in  a  star¬ 
shaped  subgraph  is  high,  and  together  with  empirical  evi¬ 
dence  supporting  this  formula,  we  shall  conclude  that  cov¬ 
ering  Ga  with  star-shaped  subgraphs  is  a  reliable  method 
for  clustering  a  6et  of  documents. 

The  star  algorithm  is  based  on  a  greedy  cover  of  the 
thresholded  similarity  graph  by  star-shaped  subgraphs;  the 
algorithm  itself  is  summarized  in  Figure  1 .  The  star  algo¬ 
rithm  is  very  efficient.  In  [APR98]  we  show  that  the  star 
algorithm  can  be  correctly  implemented  in  such  a  way  that 
given  a  thresholded  similarity  graph  Ga,  the  running  time 
of  the  algorithm  is  0(V  +  Ea),  linear  in  the  size  of  the 
input  graph. 

2.2  Cluster  Quality 

In  this  section,  we  argue  that  the  clusters  produced 
by  a  star  cover  have  high  average  intra-cluster  similar¬ 
ity  weights;  thus,  the  clusters  produced  are  accurate  and 
of  high  quality.  Consider  three  documents  C,  Si  and  S2 
which  are  vertices  in  a  star-shaped  subgraph  of  GCT,  where 

51  and  S2  are  satellite  vertices  and  C  is  the  star  center.  By 
the  definition  of  a  star-shaped  subgraph  of  G„,  we  must 
have  that  the  similarity  between  C  and  Si  is  at  least  <7  and 
that  the  similarity  between  C  and  S2  is  also  at  least  u.  In 
the  vector  space  model,  these  similarities  are  obtained  by 
taking  the  cosine  of  the  angle  between  the  vectors  associ¬ 
ated  with  each  document.  Let  c*i  be  the  angle  between  C 
and  Si,  and  let  c*2  be  the  angle  between  C  and  S2.  We 
then  have  that  cosai  >  cr  and  cos 02  >  a.  Note  that  the 
angle  between  Si  and  S2  can  be  at  most  ai  +  a2,  and 
therefore  we  have  the  following  lower  bound  on  the  simi¬ 
larity  between  satellite  vertices  in  a  star-shaped  subgraph 
of  Ga. 

Theorem  1  Let  Ga  be  a  similarity  graph  and  let  Si  and 

52  be  two  satellites  in  the  same  star  in  Ga  ■  If  a\  >  a  and 
Oil  >  cr  are  the  respective  similarities  between  Si  and  the 
star  center  and  between  S2  and  the  star  center,  then  the 


medline 

mean  square  error 
0.025  - 


Figure  2:  This  figure  shows  the  actual  mean-squared  pre¬ 
diction  error  for  a  6,000  abstract  subset  of  MEDLINE. 


Theorem  2  Let  Ga  be  a  similarity  graph  and  let  Si  and 
S2  be  two  satellites  in  the  same  star  in  Ga.  If  a  1  >  o  and 
Q2  >  (r  are  the  respective  similarities  between  Si  and  the 
star  center  and  between  S2  and  the  star  center,  then  the 
expected  similarity  between  Si  and  S2  is 


cosai  cosa2  4- 


cr 

— —smai  sin  a2. 
+  cr  . 


For  the  previous  example,  the  above  formula  would 
predict  a  similarity  between  satellite  vertices  of  approx¬ 
imately  0.78.  We  have  tested  this  formula  against  real 
data,  and  the  results  of  the  test  with  the  MEDLINE  data 
set  are  shown  in  Figure  2.  In  this  plot,  the  x-  and  y-axes 
are  similarities  between  a  cluster  center  and  each  of  two 
satellite  vertices,  and  the  z-axis  is  the  actual  mean  squared 
prediction  error  of  the  above  formula  for  the  similarity  be¬ 
tween  satellite  vertices.  Note  that  the  root  mean  square 
error  (RMS)  is  quite  small  (approximately  0.13  in  the  worst 
case),  and  for  reasonably  high  similarities,  the  error  is  neg¬ 
ligible.  From  our  tests  with  real  data,  we  have  concluded 
that  this  formula  is  quite  accurate  and  that  star-shaped 
subgraphs  are  reasonably  “dense”  in  the  sense  that  they 
imply  relatively  high  pairwise  similarities  between  docu¬ 
ments. 


2Note  that  sin  0  =  VI  —  cos2  9. 


12 


File  Query  Topics 

B 

“ 

‘FF 

H 

Collections 

Query  1:  Parallel  processing 

Documents 

r  Tuolomne  Technical  Reports 

distributed  and  parallel  systems  f 

■  _j  Tioga  Technical  Reports 
j  Muir  Technical  Reports 
j  Tenaya  Technical  Reports 

1 

,  ! 

Submit  j  Clear  query  j  ,  Quit  [ 

J 

/ 

40 

d: 

1 

IB 

- p-r- : 

II - r-  A— l,-i. 

Fjji 

t  if  \ 

File 


Query  0:  Parallel  processing 


distributed  and  parallel  systems 

A 

Relevant  documents  (40  requested,  40  found) 

f 

wV  i  • 

0  23  Multiprocessor  File  System  Interlaces  cH1> 

R 

Get  j 

0.27  The  Galley  Parallel  File  System  <.H1> 

1 

0.21  File-Access  Characteristics  of  Parallel  Scientific  Workloads  </ 

1 

050  Formal  Implementation  oi  Ffigh-Level  Languages  for  Data-Parallel 

J 

050  BuUcHng  Segment  Trees  In  Parallel  </H1> 

0.19  Dynamic  File-Access  Characteristics  of  a  Production  Parallel  Sc 

a 

Learn  | 

0.13  A  DAta-Parallel  Programming  Library  for  Education  (DAPPLE)  </H1 

s 

f  0.1S  A  Multiprocessor  Extension  to  the  Conventional  File  System  Inte 

a 

0.18  View  3:  A  Programming  Environment  tor  Distributed  Programming  < 

8 

1  0.1S  Low-level  Interfaces  for  high-level  Parallel  I/O  oH1>  | 

a 

Done  i  0/17  SPEDE:  A  Simple  Programming  Environment  lor  Distributed  Exeeutl 
_ 1  0.17  SPEDE:  Simple  Programming  Environment  tor  Distributed  Execution  / 


Figure  3:  This  is  a  screen  snapshot  from  a  clustering  experiment.  The  top  window  is  the  query  window.  The  middle 
window  consists  of  a  ranked  list  of  documents  that  were  retrieved  in  response  to  the  user  query.  The  user  may  select 
“get”  to  fetch  a  document  or  “graph”  to  request  a  graphical  visualization  of  the  clusters  as  in  the  bottom  window .  The 
left  graph  displays  all  the  documents  as  dots  around  a  circle.  Clusters  are  separated  by  gaps.  The  edges  denote  pairs  of 
documents  whose  similarity  falls  between  the  slider  parameters.  The  right  graph  displays  all  the  clusters  as  disks.  The 
radius  of  a  disk  is  proportional  to  the  size  of  the  cluster.  The  distance  between  the  disks  is  proportional  to  the  similarity 
distance  between  the  clusters. 


2.3  The  System 

We  have  implemented  a  system  for  organizing  informa¬ 
tion  that  uses  the  star  algorithm.  This  organization  sys¬ 
tem  was  used  for  the  experiments  described  in  this  paper. 
It  consists  of  an  augmented  version  of  the  Smart  system 
[Sal91,  A1195],  a  user  interface  we  have  designed,  and  an 
implementation  of  the  star  algorithm  on  top  of  Smart.  To 
index  the  documents  we  used  the  Smart  search  engine  with 


a  cosine  normalization  weighting  scheme.  We  enhanced 
Smart  to  compute  a  document  to  document  similarity  ma¬ 
trix  for  a  set  of  retrieved  documents  or  a  whole  collection. 
The  similarity  matrix  is  used  to  compute  clusters  and  to 
visualize  the  clusters.  The  user  interface  is  implemented 
in  Tcl/Tk. 

The  organization  system  can  be  run  on  a  whole  col¬ 
lection,  on  a  specified  subcollection,  or  on  the  collection  of 
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documents  retrieved  in  response  to  a  user  query.  Users  can 
input  queries  by  typing  free  text.  They  have  the  choice 
of  specifying  several  corpora.  This  system  supports  dis¬ 
tributed  information  retrieval,  but  in  this  paper  we  do  not 
focus  on  distribution  and  we  assume  only  one  centrally  lo¬ 
cated  corpus.  In  response  to  a  user  query,  Smart  is  invoked 
to  produce  a  ranked  list  of  the  top  most  relevant  docu¬ 
ments,  their  titles,  locations  and  document-to-document 
similarity  information.  The  similarity  information  for  the 
entire  collection,  or  for  the  collection  computed  by  the 
query  engine  is  provided  as  input  to  the  star  algorithm. 
This  algorithm  returns  a  list  of  clusters  and  marks  their 
centers. 

2.4  Visualization 

We  have  developed  a  visualization  method  for  organized 
data  that  presents  users  with  three  views  of  the  data  (see 
Figure  3):  a  list  of  text  titles,  a  graph  that  shows  the  sim¬ 
ilarity  relationship  between  the  documents,  and  a  graph 
that  shows  the  similarity  relationship  between  the  clusters. 
These  views  provide  users  with  summaries  of  the  data  at 
different  levels  of  detail  (text,  document  and  topic)  and 
facilitate  browsing  by  topic  structure. 

The  connected  graph  view  (inspired  by  [A1195] )  has  nodes 
corresponding  to  the  retrieved  documents.  The  nodes  are 
placed  in  a  circle,  with  nodes  corresponding  to  the  same 
cluster  placed  together.  Gaps  between  the  nodes  allow  us 
to  identify  clusters  easily.  Edges  between  nodes  are  color 
coded  according  to  the  similarity  between  the  documents. 
Two  slider  bars  allow  the  user  to  establish  minimal  and 
maximal  weight  of  edges  to  be  shown. 

Another  view  presents  clusters  as  disks  whose  size  is 
proportional  to  the  size  of  the  corresponding  cluster.  The 
distance  between  two  clusters  is  defined  as  a  distance  be¬ 
tween  the  central  documents  and  captures  the  topic  sepa¬ 
ration  between  the  clusters.  Simulated  annealing  is  used 
to  find  a  cluster  placement  that  minimizes  the  sum  of  rel¬ 
ative  distance  errors  between  clusters.  We  selected  a  cool¬ 
ing  schedule  a(t)  =  t/(l  +  (3t),  where  /3  =  1(T3,  initial 
temperature  is  500  and  the  freezing  point  is  KT2.  This 
setting  provides  a  good  placement  when  the  number  of 
clusters  returned  by  the  algorithm  is  small.  This  algo¬ 
rithm  is  fast,  and  its  running  time  does  not  depend  on 
the  number  of  clusters.  When  the  number  of  clusters  is 
large,  the  ellipsoid-based  method  for  Euclidean  graph  em¬ 
beddings  described  in  [LLR95]  can  be  used  instead. 

All  three  views  and  a  title  window  allow  the  user  to  se¬ 
lect  an  individual  document  or  a  cluster.  A  selection  made 
in  one  window  is  simultaneously  reflected  in  the  others. 

2.5  Performance  Comparison  with 
Two  Clustering  Algorithms 

In  order  to  evaluate  the  performance  of  our  system,  we 
tested  the  star  algorithm  against  two  classic  clustering  al¬ 
gorithms:  the  single  link  method  [Cro77]  and  the  average 
link  method  [Voo85],  We  used  data  from  the  TREC-6  con¬ 
ference  as  our  testing  medium.  The  TREC  collection  con¬ 
tains  a  set  of  130,471  documents  of  which  21,694  have  been 
ascribed  relevance  data  with  respect  to  47  topics.  These 
21,694  documents  were  partitioned  into  22  separate  subcol¬ 
lections  of  approximately  1,000  documents  each.  Within 
a  subcollection,  each  of  the  47  topics  has  a  corresponding 
subset  of  documents  which  is  relevant  to  that  topic. 


The  goal  of  a  clustering  method  is  to  organize  the  set  of 
documents  in  such  a  way  that  the  subset  of  documents  cor¬ 
responding  to  a  selected  topic  appears  as  a  cluster  in  the 
organization.  For  each  of  the  subcollections,  we  performed 
the  following  experiment.  Given  a  selected  topic,  the  set  of 
documents  was  organized  by  a  clustering  method  in  ques¬ 
tion,  and  the  “best”  cluster  corresponding  to  this  topic 
was  returned.  Two  issues  immediately  arise:  first,  how 
does  one  measure  the  “quality”  of  a  cluster  to  determine 
which  is  “best”;  and  second,  how  does  one  appropriately 
generate  clusters  from  which  to  choose.  To  measure  the 
quality  of  a  cluster,  we  use  the  common  E  measure  [Rij79] 
as  defined  below 


E(p,  r)  =  1  - 


2 

1/p  +  1/r 


where  p  and  r  are  the  standard  precision  and  recall  of  the 
cluster  with  respect  to  the  set  of  documents  relevant  to 
the  topic.  Note  that  E(p,  r)  is  simply  one  minus  the  har¬ 
monic  mean  of  the  precision  and  recall;  thus,  E(p,  r)  ranges 
from  0  to  1  where  E(p,  r)  =  0  corresponds  to  perfect  pre¬ 
cision  and  recall  and  E(p,  r)  =  1  corresponds  to  zero  pre¬ 
cision  and  recall.  It  is  worthwhile  to  note  that  when  view¬ 
ing  data  comparing  two  clustering  methods,  lower  E(p,  r) 
values  correspond  to  better  performance.  In  order  to  com¬ 
pare  the  clustering  methods  fairly,  each  of  the  methods 
was  run  in  such  a  way  so  as  to  produce  the  “best”  possi¬ 
ble  cluster  with  respect  to  a  given  topic,  as  defined  by  the 
E(p,  r)  measure  above.  (This  is  in  keeping  with  previous 
comparative  analyses  of  clustering  methods;  see,  for  exam¬ 
ple,  Burgin  [Bur95]  and  Shaw  [Sha93].)  In  the  case  of  the 
single  link  and  star  cover  algorithms,  the  algorithms  were 
run  using  a  range  of  thresholds,  and  the  best  cluster  ob¬ 
tained  over  all  thresholds  was  returned.  (One  can  view 
the  clustering  obtained  with  respect  to  a  given  thresh¬ 
old  as  a  “slice”  within  a  hierarchical  clustering  over  all 
thresholds;  thus,  in  effect,  the  best  cluster  in  the  hierar¬ 
chy  was  returned  in  these  experiments.)  In  the  case  of  the 
average-link  algorithm  which  naturally  produces  a  hierar¬ 
chical  clustering,  the  best  cluster  within  the  hierarchy  was 
returned. 

Unlike  the  star  algorithm,  single  and  average  link  al¬ 
gorithms  do  not  allow  overlapping  clusters.  It  has  been 
suggested  [A1198]  that  the  differences  in  performance  may 
be  attributed  to  the  effects  of  overlapping  rather  than  to 
the  actual  properties  of  the  algorithm.  To  investigate  this 
issue  we  conducted  the  same  experiments  using  a  version 
of  the  star  clustering  algorithm  that  eliminates  the  over¬ 
lapping  clusters.  In  this  setting  we  used  the  star  algorithm 
to  find  a  set  of  star  centers,  then  partitioned  a  collection 
by  assigning  a  document  to  the  closest  star  center.  This 
methodology  has  been  used  before  [JD88].  We  note  that 
the  difference  in  results  between  star  with  overlapping  clus¬ 
ters  and  star  without  overlapping  clusters  is  very  small. 
Both  algorithms  still  outperform  single  link  and  average 
link  (See  Figure  4). 

Each  subcollection  of  1,000  documents  corresponded  to 
an  individual  experiment.  For  a  given  clustering  method, 
the  appropriate  algorithm  was  employed  to  determine  the 
best  possible  cluster  (as  defined  by  the  E(p,  r)  measure) 
for  each  of  the  47  topics.  For  each  optimal  cluster,  the 
E(p,r),  precision  and  recall  values  were  calculated  with 
respect  to  the  actual  set  of  documents  relevant  to  the  topic, 
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Figure  4:  This  figure  shows  comparison  data  for  the  star  algorithm,  the  partitioning  star  algorithm,  the  single  link 
algorithm,  and  the  average  link  algorithm  for  22  subcollections  of  TREC  documents.  For  each  algorithm,  p  represents  the 
average  precision  computed  across  all  clusters  found  for  the  collection;  r  represents  the  average  recall  computed  across  all 
clusters  found  for  the  collection;  and  E{p,  r)  is  the  aggregate  measure  1  -  ■ 


cluster  # 


Figure  5:  This  figure  shows  the  E(p,r)  measure  for  the 
partitioning  star  clustering  algorithm  and  for  the  single 
link  clustering  algorithm.  The  y  axis  shows  the  E{p,  r ) 
measure,  while  the  x  axis  shows  the  cluster  number.  Clus¬ 
ters  have  been  sorted  according  to  the  E(p,  revalues  of  the 
star  algorithm. 

and  these  values  were  averaged  over  all  topics  to  obtain  the 
three  numbers  reported  for  each  experiment  and  clustering 
method  in  Figure  4.  Averaging  over  all  22  experiments, 
we  find  that  the  mean  E(p,  r)  values  for  star,  partitioning 
star,  average  link  and  single  link  are  0.37,  0.39,  0.43  and 
0.45,  respectively.  Thus,  the  star  algorithm  represents  a 
16.2%  improvement  in  performance  with  respect  to  average 
link  and  an  21.6%  improvement  with  respect  to  single  link. 
The  difference  is  only  partly  due  to  the  effect  of  allowing 
overlapping  clusters  -  the  partitioning  star  algorithm  still 
gives  us  a  10.2%  and  15.4%  improvement  in  performance 
over  average  link  and  single  link  respectively. 
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Figure  6:  This  figure  shows  the  E(p,  r)  measure  for  the 
partitioning  star  clustering  algorithm  and  for  the  average 
link  clustering  algorithm.  The  y  axis  shows  the  E(p,  r ) 
measure,  while  the  x  axis  shows  the  cluster  number.  Clus¬ 
ters  have  been  sorted  according  to  the  E(p,  r )  values  of  the 
star  algorithm. 


We  repeated  this  experiment  on  the  same  data,  using 
one  collection  only  (of  21,694  documents.)  The  precision, 
recall,  and  E  values  for  star  (overlap),  star,  average  link, 
and  single  link  were  (.52,  .36,  .58),  (.53,  .32,  .61),  (.63,  .25, 
.64),  and  (.66,  .20,  .70)  respectively.  We  note  that  the  E 
measures  are  worse  for  all  four  algorithms  on  this  larger 
collection  and  that  the  star  algorithm  outperforms  average 
link  by  10.3%  and  single  link  by  20.7%. 

Figures  5  and  6  show  detailed  E(p,  r)  values  for  the  star 
algorithm  vs.  the  single  link  algorithm  and  for  the  star  al¬ 
gorithm  vs.  the  average  link  algorithm  over  the  collection  of 
experiments.  Each  cluster  computed  by  the  algorithm  has 
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an  E(p,  r )  value.  For  better  readability  of  these  graphs,  we 
sorted  the  clusters  produced  by  the  star  algorithm  accord¬ 
ing  to  their  E(p,  r )  values.  We  plotted  the  corresponding 
E(p,  r)  values  for  the  single  link  algorithm  (see  the  oscillat¬ 
ing  line  in  Figure  5)  and  for  the  average  link  algorithm  (see 
the  oscillating  line  in  Figure  6).  We  note  that  the  E(p,  r) 
values  for  the  star  clusters  are  almost  everywhere  lower 
than  the  corresponding  values  for  the  single  link  and  aver¬ 
age  link  algorithms;  thus,  the  star  algorithm  outperforms 
these  two  methods. 

These  experiments  show  that  the  star  algorithm  out¬ 
performs  the  single  link  and  average  link  methods.  Since 
the  star  algorithm  is  also  simple  to  implement  and  highly 
efficient,  we  believe  that  the  star  algorithm  is  very  effec¬ 
tive  for  information  organization  and  other  text  clustering 
applications. 

3  On-line  Information  Organization 


Figure  7:  This  figure  shows  the  star  cover  change  after  the 
insertion  of  a  new  vertex.  The  larger-radius  disks  denote 
star  centers,  the  other  disks  denote  satellite  vertices.  The 
star  edges  are  denoted  by  solid  lines.  The  inter-satellite 
edges  are  denoted  by  dotted  lines.  The  top  figure  shows  an 
initial  graph  and  its  star  cover.  The  middle  figure  shows 
the  graph  after  the  insertion  of  a  new  document.  The 
bottom  figure  shows  the  star  cover  of  the  new  graph. 

In  this  section  we  consider  algorithms  for  computing 
the  organization  of  a  dynamic  information  system.  We 
derive  an  on-line  version  of  the  star  algorithm  for  informa¬ 
tion  organization  that  can  incrementally  compute  clusters 
of  similar  documents.  We  continue  assuming  the  vector 
space  model  and  the  cosine  metric  to  capture  the  pairwise 
similarity  between  the  documents  of  the  corpus. 

3.1  The  On-line  Star  Algorithm 

We  assume  that  documents  are  inserted  or  deleted  from 
the  collection  one  at  a  time.  For  simplicity,  we  will  focus 


our  discussion  on  adding  documents  to  the  collection.  The 
delete  algorithm  is  similar.  The  intuition  behind  the  in¬ 
cremental  computation  of  the  star  cover  of  a  graph  after 
a  new  vertex  is  inserted  is  depicted  in  Figure  7.  The  top 
figure  denotes  a  thresholded  similarity  graph  and  a  correct 
star  cover  for  this  graph.  Suppose  a  new  vertex  is  inserted 
in  the  graph,  as  in  the  middle  figure.  The  original  star 
cover  is  no  longer  correct  for  the  new  graph.  The  bottom 
figure  shows  the  correct  star  cover  for  the  new  graph.  How 
does  the  addition  of  this  new  vertex  affect  the  correctness 
of  the  star  cover?  In  general,  the  answer  depends  on  the 
degree  of  the  new  vertex  and  on  its  adjacency  list.  If  the 
adjacency  list  of  the  new  vertex  does  not  contain  any  star 
centers,  the  new  vertex  can  be  added  in  the  star  cover  as 
a  star  center.  If  the  adjacency  list  of  the  new  vertex  con¬ 
tains  any  center  vertex  c  whose  degree  is  higher,  the  new 
vertex  becomes  a  satellite  vertex  of  c.  The  difficult  case 
that  destroys  the  correctness  of  the  star  cover  is  when  the 
new  vertex  is  adjacent  to  a  collection  of  star  centers,  each 
of  whose  degree  is  lower  than  that  of  the  new  vertex.  In 
this  situation,  the  star  structure  already  in  place  has  to  be 
modified  to  assign  the  new  vertex  as  a  star  center.  The 
satellite  vertices  in  the  stars  that  are  broken  as  a  result 
have  to  be  re-evaluated. 

Motivated  by  the  intuition  in  the  previous  paragraph, 
we  now  describe  an  on-line  algorithm  for  incrementally 
computing  star  covers  of  dynamic  graphs.  The  algorithm 
is  shown  in  Figure  8.  This  algorithm  uses  a  special  data 
structure  to  efficiently  maintain  the  star  cover  of  an  undi¬ 
rected  graph  G  =  (V,E).  For  each  vertex  v  €  V,  we 
maintain  the  following  data. 


v.type 
v. degree 
v.adj 
v.  centers 
v.inQ 


satellite  or  center 
degree  of  v 

list  of  adjacent  vertices 
list  of  adjacent  centers 
flag  specifying  if  v  being  processed 


Note  that  while  v.type  can  be  inferred  from  v. centers 
and  v. degree  can  be  inferred  from  v.adj,  it  will  be  conve¬ 
nient  to  have  all  five  pieces  of  data  in  the  algorithm.  Let 
a  be  a  vertex  to  be  added  to  G,  and  let  L  be  the  list  of 
vertices  in  G  which  are  adjacent  to  a.  The  algorithm  in 
Figure  8  will  appropriately  update  the  star  cover  of  G.  See 
[APR.97]  for  a  more  detailed  correctness  argument. 

3.2  Analysis 

We  have  shown  that  the  star  cover  produced  by  the  on¬ 
line  star  algorithm  is  correct  in  that  it  is  identical  to  the 
star  cover  produced  by  the  off-line  algorithm  (or  one  of  the 
correct  covers,  if  more  than  one  exists)  [APR97],  Further¬ 
more,  the  on-line  star  algorithm  is  very  efficient.  In  our  ini¬ 
tial  tests,  we  have  implemented  the  on-line  star  algorithm 
using  a  heap  for  the  priority  queue  and  simple  linked  lists 
for  the  various  lists  required.  The  time  required  to  insert 
a  new  vertex  and  associated  edges  into  a  thresholded  sim¬ 
ilarity  graph  and  to  appropriately  update  the  star  cover  is 
largely  governed  by  the  number  of  stars  that  are  broken  | 
during  the  update,  since  breaking  stars  requires  inserting 
new  elements  into  the  priority  queue.  In  practice,  very  few 
stars  are  broken  during  any  given  update  (see  Figure  9). 
This  is  due  partly  to  the  fact  that  relatively  few  stars  exist  j 
at  any  given  time  (as  compared  to  the  number  of  vertices 
or  edges  in  the  thresholded  similarity  graph)  and  partly  to 
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the  fact  that  the  likelihood  of  breaking  any  individual  star 
is  also  small  [APR97], 


Update(q,L) 

1  a.  type  *—  satellite 

2  a. degree  <—  0 

3  a.adj  <—  0 

4  a.  centers $ 

5  forall  p  in  L 

6  a. degree  *— a. degree  +  \ 

7  p. degree  <—  p.degree+  1 

8  Insert  (P,  a.adj) 

9  lNSERT(a,/3.adj) 

10  if  (p.  type  =  center) 

11  Insert(/3,q.  centers) 

12  else 

13  p.inQ  <—  true 

14  Enqueue(/3,  Q) 

15  endif 

16  endfor 

17  a.inQ  <—  true 

18  Enqueue^,  <2) 

19  while  ( Q  0) 

20  <f>  <—  ExtractMax(Q) 

21  if  (d>.  centers  =  0) 

22  tf>.  type  <—  center 

23  forall  P  in  p.adj 

24  Insert  (<f>,p.  centers) 

25  endfor 

26  else 

27  if  (V<5  G  p.  centers,  8.  degree  <  <j>.  degree) 

28  cj). typer-  center 

29  forall  /3  in  p.adj 

30  INSERT  (<t>,p.  centers) 

31  endfor 

32  forall  <5  in  <j>. centers 

33  6-type  <—  satellite 

34  forall  fi  in  S.adj 

35  Delete(<5,  p.  centers) 

36  if  ( p.inQ  =  false) 

37  p.inQ  *—  true 

38  Enqueue^,  Q) 

39  endif 

40  endfor 

41  endfor 

42  endif 

43  endif 

44  p.inQ  <—  false 

45  endwhile 


Figure  8:  The  on-line  star  algorithm  for  clustering. 

We  evaluated  the  on-line  star  cover  algorithm  on  a  2224 
document  corpus  consisting  of  a  judged  subcollection  of 
TREC  documents  augmented  with  our  department’s  tech¬ 
nical  reports.  We  ran  4  experiments.  Each  time  we  se¬ 
lected  a  different  threshold  and  proceeded  to  insert  the 
2224  documents  in  random  order,  using  the  on-line  star 
cluster  algorithm.  The  results  of  these  experiments  were 
averaged.  The  running  time  measurements  appear  to  be 
linear  in  the  number  of  edges  of  the  similarity  graph.  Fig¬ 


ures  9  and  10  show  the  experimental  data.  Note  that  the 
number  of  broken  stars  is  roughly  linear  in  the  number  of 
vertices,  the  running  time  is  linear  in  the  number  of  edges 
in  the  graph,  although  we  can  see  the  effects  of  lower  order 
terms. 


Figure  9.  The  dependence  of  number  of  broken  stars  on 
the  number  of  vertices  for  TREC  data. 


0.0e+00  2.0e+05  4.0e+05  6.0e+05  8.0e+05  1.0e+06 
number  of  edges 


Figure  10:  This  figure  shows  the  dependence  of  the  running 
time  of  the  on-line  star  algorithm  on  the  number  of  edges 
in  a  TREC  subcollection. 

4  Discussion 

We  have  presented,  analyzed,  and  evaluated  the  star 
clustering  algorithm  for  information  organization.  We  de¬ 
scribed  an  off-line  version  of  this  algorithm  that  can  be 
used  to  organize  static  information  in  accurate  clusters  ef¬ 
ficiently.  We  also  described  an  on-line  version  of  the  al¬ 
gorithm  that  can  be  used  to  organize  dynamic  data  for 
tasks  that  require  incremental  updates  in  the  topic  struc¬ 
ture  of  the  corpus,  such  as  the  routing  task,  the  new  topic 
detection  task,  and  the  topic  tracking  task. 

Our  implementation  of  this  algorithm  contributes  a  novel 
visualization  method  for  clusters  that  presents  users  with 
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disks  whose  radii  correspond  to  the  cluster  size  and  that 
are  embedded  in  the  plane  so  as  to  capture  the  topic  dis¬ 
tance  between  the  clusters. 

We  evaluated  the  star  algorithm  by  comparing  it  against 
the  single  link  and  the  average  link  algorithms  in  several 
experiments  with  TREC  data.  We  found  that  the  star  al¬ 
gorithm  outperforms  the  single  link  algorithm  and  the  av¬ 
erage  link  algorithm.  Since  the  star  algorithm  is  faster  and 
easier  to  implement  and  than  the  average  link  algorithm, 
we  advocate  its  use.  The  on-line  algorithm  produces  the 
same  clustering  as  the  off-line  algorithm.  Thus,  our  evalu¬ 
ation  of  the  off-line  star  algorithm  also  suggests  using  the 
on-line  star  algorithm  for  tasks  that  require  computing  the 
topic  structure  incrementally  and  adaptively. 

Our  findings  so  far  suggest  using  the  star  algorithm  for  a 
variety  of  tasks.  We  are  currently  conducting  experiments 
using  the  on-line  star  algorithm  for  new  topic  detection  and 
topic  tracking.  Because  of  its  cluster  quality,  efficiency,  and 
incremental  properties,  we  believe  this  algorithm  will  lead 
to  improved  results  in  solving  these  tasks. 
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Abstract 

We  present  and  analyze  the  off-line  star  algorithm  for  clus¬ 
tering  static  information  systems  and  the  on-line  star  algo¬ 
rithm  for  clustering  dynamic  information  systems.  These 
algorithms  organize  a  document  collection  into  a  number  of 
clusters  that  is  naturally  induced  by  the  collection  via  a  com¬ 
putationally  efficient  cover  by  dense  subgraphs.  We  further 
show  a  lower  bound  on  the  accuracy  of  the  clusters  produced 
by  these  algorithms  as  well  as  demonstrate  that  these  algo¬ 
rithms  are  efficient  (running  times  roughly  linear  in  the  size 
of  the  problem).  Finally,  we  provide  data  from  a  number  of 
experiments. 

1  Introduction 

We  wish  to  create  more  versatile  information  capture 
and  access  systems  for  digital  libraries  by  using  infor¬ 
mation  organization:  thousands  of  electronic  documents 
will  be  organized  automatically  as  a  hierarchy  of  topics 
and  subtopics,  using  algorithms  grounded  in  geometry, 
probabilities,  and  statistics.  Off-line  information  orga¬ 
nization  algorithms  will  be  useful  for  organizing  static 
collections  (for  example,  large-scale  legacy  data).  In¬ 
cremental,  on-line  information  organization  algorithms 
will  be  useful  to  keep  dynamic  corpora,  such  as  news 
feeds,  organized.  Current  information  systems  such  as 
Inquery  [Tur90],  Smart  [Sal91],  or  Alta  Vista  provide 
some  simple  automation  by  computing  ranked  (sorted) 
lists  of  documents,  but  it  is  ineffective  for  users  to  scan 
a  list  of  hundreds  of  document  titles.  To  cull  the  rele¬ 
vant  information  out  of  a  large  set  of  potentially  useful 
dynamic  sources  we  need  methods  for  organizing  and 
reorganizing  dynamic  information  as  accurate  clusters, 
and  ways  of  presenting  users  with  the  topic  summaries 
at  various  levels  of  detail. 

There  has  been  extensive  research  on  clustering  and 
applications  to  many  domains  [HS86,  AB84].  For  a 
good  overview  see  [JD88].  For  a  good  overview  of  using 

*  Research  partially  supported  by  ONR  contract  N00014-95- 
1-1204,  Rome  Labs  contract  F30602-98-C-0006,  and  Air  Force 
MURI  contract  F49620-97- 1-0382. 
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clustering  in  Information  Retrieval  (IR)  see  [Wil88]. 
The  use  of  clustering  in  IR  was  mostly  driven  by  the 
cluster  hypothesis  [Rij79]  which  states  that  relevant 
documents  tend  to  be  more  closely  related  to  each 
other  than  to  non-relevant  documents.  Jardine  and 
van  Rijsbergen  [JR71]  show  some  evidence  that  search 
results  could  be  improved  by  clustering.  Hearst  and 
Pedersen  [HP96]  re-examine  the  cluster  hypothesis  by 
focusing  on  the  Scatter/Gather  system  [CKP93]  and 
conclude  that  it  holds  for  browsing  tasks. 

Systems  like  Scatter/Gather  [CKP93]  provide  a 
mechanism  for  user-driven  organization  of  data  in  a 
fixed  number  of  clusters,  but  the  users  need  to  be 
in  the  loop  and  the  computed  clusters  do  not  have 
accuracy  guarantees.  Scatter/Gather  uses  fractiona¬ 
tion  to  compute  nearest-neighbor  clusters.  Charika, 
et  al.  [CCFM97]  consider  a  dynamic  clustering  algo¬ 
rithm  to  partition  a  collection  of  text  documents  into  a 
fixed  number  of  clusters.  Since  in  dynamic  information 
systems  the  number  of  topics  is  not  known  a  priori,  a 
fixed  number  of  clusters  cannot  generate  a  natural  par¬ 
tition  of  the  information. 

Our  work  on  clustering  presented  in  this  paper 
and  in  [APR97]  provides  positive  evidence  for  the 
cluster  hypothesis.  We  propose  an  off-line  algorithm 
for  clustering  static  information  and  an  on-line  version 
of  this  algorithm  for  clustering  dynamic  information. 
These  two  algorithms  compute  clusters  induced  by  the 
natural  topic  structure  of  the  space.  Thus,  this  work 
is  different  than  [CKP93,  CCFM97]  in  that  we  do 
not  impose  the  constraint  to  use  a  fixed  number  of 
clusters.  As  a  result,  we  can  guarantee  a  lower  bound 
on  the  topic  similarity  between  the  documents  in  each 
cluster.  The  model  for  topic  similarity  is  the  standard 
vector  space  model  used  in  the  information  retrieval 
community  [Sal89]  which  is  explained  in  more  detail  in 
this  paper  in  Section  2. 

To  compute  accurate  clusters,  we  formalize  cluster¬ 
ing  as  covering  graphs  by  cliques.  Covering  by  cliques  is 
NP-complete,  and  thus  intractable  for  large  document 
collections.  Unfortunately,  it  has  also  been  shown  that 
the  problem  cannot  even  be  approximated  in  polynomial 
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time  [LY94,  Zuc93].  We  instead  use  a  cover  by  dense 
subgraphs  that  are  star-shaped  and  that  can  be  com¬ 
puted  off-line  for  static  data  and  on-line  for  dynamic 
data.  We  show  that  the  off-line  and  on-line  algorithms 
produce  correct  clusters  efficiently.  Asymptotically,  the 
running  time  of  both  algorithms  is  roughly  linear  in  the 
size  of  the  similarity  graph  that  defines  the  information 
space  (explained  in  detail  in  Section  2).  We  also  show 
lower  bounds  on  the  topic  similarity  within  the  com¬ 
puted  clusters  (a  measure  of  the  accuracy  of  our  clus¬ 
tering  algorithm)  as  well  as  provide  experimental  data. 

Finally,  we  compare  the  performance  of  the  star 
algorithm  to  two  widely  used  algorithms  for  clustering 
in  IR  and  other  settings:  the  single  link  method1  [Cro77] 
and  the  average  link  algorithm2  [Voo85].  Neither 
algorithm  provides  guarantees  for  the  topic  similarity 
within  a  cluster.  The  single  link  algorithm  can  be 
used  in  off-line  and  on-line  mode,  and  it  is  faster  than 
the  average  link  algorithm,  but  it  produces  poorer 
clusters  than  the  average  link  algorithm.  The  average 
link  algorithm  can  only  be  used  off-line  to  process 
static  data.  The  star  clustering  algorithm,  on  the 
other  hand,  computes  topic  clusters  that  are  naturally 
induced  by  the  collection,  provides  guarantees  on  cluster 
quality,  computes  more  accurate  clusters  than  either 
the  single  link  or  average  link  methods,  is  efficient, 
admits  an  efficient  and  simple  on-line  version,  and  can 
perform  hierarchical  data  organization.  We  describe 
experiments  in  this  paper  with  the  TREC3  database 
demonstrating  these  abilities. 

Our  algorithms  for  organizing  information  systems 
can  be  used  in  several  ways.  The  off-line  algorithm  can 
be  used  as  a  pre-processing  step  in  a  static  informa¬ 
tion  system  or  as  a  post-processing  step  on  the  specific 
documents  retrieved  by  a  query.  As  a  pre-processor, 
this  system  assists  users  with  deciding  how  to  browse 
a  database  of  free  text  documents  by  highlighting  rel¬ 
evant  topics  and  irrelevant  subtopics.  Such  clustered 
data  is  useful  for  narrowing  down  the  database  over 
which  detailed  queries  can  be  formulated.  As  a  post¬ 
processor,  this  system  classifies  the  retrieved  data  into 
clusters  that  capture  topic  categories  and  subcategories. 
The  on-line  algorithm  can  be  used  as  a  basis  for  con¬ 
structing  self-organizing  information  systems.  As  the 
content  of  dynamic  information  systems  changes,  the 
on-line  algorithm  can  efficiently  automate  the  process 
of  organization  and  re-organization  to  compute  accu¬ 

lIn  the  single  link  clustering  algorithm  a  document  is  part  of 
a  cluster  if  it  is  “related”  to  at  least  one  document  in  the  cluster. 

2In  the  average  link  clustering  algorithm  a  document  is  part 
of  a  cluster  if  it  is  “related”  to  an  average  number  of  documents 
in  the  cluster. 
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rate  topic  summaries  at  various  level  of  similarity. 

2  Clustering  static  data  with  star-shaped 

subgraphs 

In  this  section  we  motivate  and  present  an  off-line 
algorithm  for  organizing  information  systems.  The 
algorithm  is  very  simple  and  efficient,  and  it  computes 
high-density  clusters. 

We  formulate  our  problem  by  representing  an  in¬ 
formation  system  by  its  similarity  graph.  A  similarity 
graph  is  an  undirected,  weighted  graph  G  =  ( V,E,w ) 
where  vertices  in  the  graph  correspond  to  documents 
and  each  weighted  edge  in  the  graph  corresponds  to  a 
measure  of  similarity  between  two  documents.  We  mea¬ 
sure  the  similarity  between  two  documents  by  using  the 
cosine  metric  in  the  vector  space  model  of  the  Smart 
information  retrieval  system  [Sal91,  Sal89]. 

The  vector  space  model  for  textual  information 
aggregates  statistics  on  the  occurrence  of  words  in 
documents.  The  premise  of  the  vector  space  model  is 
that  two  documents  are  similar  if  they  use  the  same 
words.  A  vector  space  can  be  created  for  a  collection 
(or  corpus)  of  documents  by  associating  each  important 
word  in  the  corpus  with  one  dimension  in  the  space.  The 
result  is  a  high  dimensional  vector  space.  Documents 
are  mapped  as  points  in  this  space  according  to  their 
word  frequencies.  Similar  documents  map  to  nearby 
points.  In  the  vector  space  model,  document  similarity 
is  measured  by  the  angle  between  the  corresponding 
document  vectors.  The  standard  in  the  information 
retrieval  community  is  to  map  the  angles  to  the  interval 
[0, 1]  by  taking  the  cosine  of  the  vector  angles. 

G  is  a  complete  graph  with  edges  of  varying  weight. 
An  organization  of  the  graph  that  produces  reliable 
clusters  of  similarity  a  (i.e.,  clusters  where  documents 
have  pairwise  similarities  of  at  least  a)  can  be  obtained 
by  (1)  thresholding  the  graph  at  a  and  (2)  performing 
a  minimum  clique  cover  with  maximal  cliques  on  the 
resulting  graph  Ga.  The  thresholded  graph  Ga  is  an 
undirected  graph  obtained  from  G  by  eliminating  all 
the  edges  whose  weights  are  lower  that  o.  The  minimum 
clique  cover  has  two  features.  First,  by  using  cliques  to 
cover  the  similarity  graph,  we  are  guaranteed  that  all 
the  documents  in  a  cluster  have  the  desired  degree  of 
similarity.  Second,  minimal  clique  covers  with  maximal 
cliques  allow  vertices  to  belong  to  several  clusters.  In 
our  information  retrieval  application  this  is  a  desirable 
feature  as  documents  can  have  multiple  subthemes. 

Unfortunately,  this  approach  is  computationally 
intractable.  For  real  corpora,  similarity  graphs  can  be 
very  large.  The  clique  cover  problem  is  NP-complete, 
and  it  does  not  admit  polynomial-time  approximation 
algorithms  [LY94,  Zuc93].  While  we  cannot  perform 
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a  clique  cover  nor  even  approximate  such  a  cover,  we 
can  instead  cover  our  graph  by  dense  subgraphs.  What 
we  lose  in  intra-cluster  similarity  guarantees,  we  gain 
in  computational  efficiency.  In  the  sections  that  follow, 
we  describe  off-line  and  on-line  covering  algorithms  and 
analyze  their  performance  and  efficiency. 

2.1  Dense  Star-Shaped  Covers 

We  approximate  a  clique  cover  by  covering  the  associ¬ 
ated  thresholded  similarity  graph  with  star-shaped  sub¬ 
graphs.  A  star-shaped  subgraph  onm+1  vertices  con¬ 
sists  of  a  single  star  center  and  m  satellite  vertices , 
where  there  exist  edges  between  the  star  center  and  each 
of  the  satellite  vertices.  While  finding  cliques  in  the 
thresholded  similarity  graph  Ga  guarantees  a  pairwise 
similarity  between  documents  of  at  least  a,  it  would  ap¬ 
pear  at  first  glance  that  finding  star-shaped  subgraphs 
in  Ga  would  provide  similarity  guarantees  between  the 
star  center  and  each  of  the  satellite  vertices,  but  no  such 
similarity  guarantees  between  satellite  vertices.  How¬ 
ever,  by  investigating  the  geometry  of  our  problem  in 
the  vector  space  model,  we  can  derive  a  lower  bound 
on  the  similarity  between  satellite  vertices  as  well  as 
provide  a  formula  for  the  expected  similarity  between 
satellite  vertices.  The  latter  formula  predicts  that  the 
pairwise  similarity  between  satellite  vertices  in  a  star- 
shaped  subgraph  is  high,  and  together  with  empirical 
evidence  supporting  this  formula,  we  shall  conclude  that 
covering  Ga  with  star-shaped  subgraphs  is  an  accurate 
method  for  clustering  a  set  of  documents. 

Consider  three  documents  C,  S\  and  S2  which  are 
vertices  in  a  star-shaped  subgraph  of  Ga,  where  Si  and 
S2  are  satellite  vertices  and  C  is  the  star  center.  By 
the  definition  of  a  star-shaped  subgraph  of  Ga,  we  must 
have  that  the  similarity  between  C  and  Si  is  at  least  <7 
and  that  the  similarity  between  C  and  S2  is  also  at 
least  <7.  In  the  vector  space  model,  these  similarities 
are  obtained  by  taking  the  cosine  of  the  angle  between 
the  vectors  associated  with  each  document.  Let  ai  be 
the  angle  between  C  and  Si,  and  let  a2  be  the  angle 
between  C  and  S2.  We  then  have  that  cosqi  >  a  and 
cos  a2  >  a.  Note  that  the  angle  between  Si  and  S2  can 
be  at  most  oq  +a2,  and  therefore  we  have  the  following 
lower  bound  on  the  similarity  between  satellite  vertices 
in  a  star-shaped  subgraph  of  G„ . 

PROPOSITION  2.1.  Let  Ga  be  a  similarity  graph 
and  let  Si  and  S2  be  two  satellites  in  the  same  star 
in  Ga .  Then  the  similarity  between  Si  and  S2  must  be 
at  least 

cos(ai  +  q2)  =  cosaq  cosa2  —  sin«i  sina2. 

If  o  —  0.7,  cosaq  =  0.75  and  cosa2  =  0.85,  for 
instance,  we  can  conclude  that  the  similarity  between 


the  two  satellite  vertices  must  be  at  least4 

(0.75)  •  (0.85)  -  v'l  ~  (0.75)  Vl  -  (0-85)2  a  0.29. 

Note  that  while  this  may  not  seem  very  encouraging, 
the  above  analysis  is  based  on  absolute  worst-case 
assumptions,  and  in  practice,  the  similarities  between 
satellite  vertices  are  much  higher.  We  further  undertook 
a  study  to  determine  the  expected  similarity  between 
two  satellite  vertices. 

2.2  The  random  graph  model 
The  model  we  use  for  analysis  is  the  random  graph 
model  [Bol95].  A  random  graph  G„iP  is  an  undirected 
graph  with  n  vertices,  where  each  of  its  possible  edges  is 
inserted  randomly  and  independently  with  probability 
p.  Our  problem  fits  the  random  graph  model  if  we  make 
the  mathematical  assumption  that  “similar”  documents 
are  essentially  “random  perturbations”  of  one  another  in 
the  vector  space  model.  This  assumption  is  equivalent 
to  viewing  the  similarity  between  two  related  documents 
as  a  random  variable.  By  thresholding  the  edges  of 
the  similarity  graph  at  a  fixed  value,  for  each  edge 
of  the  graph  there  is  a  random  chance  (depending 
on  whether  the  value  of  the  corresponding  random 
variable  is  above  or  below  the  threshold  value)  that 
the  edge  remains  in  the  graph.  This  thresholded 
similarity  graph  is  thus  a  random  graph.  While  random 
graphs  do  not  perfectly  model  the  thresholded  similarity 
graphs  obtained  from  actual  document  corpora  (the 
actual  similarity  graphs  must  satisfy  various  geometric 
constraints  and  will  be  aggregates  of  many  “sets”  of 
“similar”  documents),  random  graphs  are  easier  to 
analyze,  and  our  experiments  provide  evidence  that 
results  obtained  for  random  graphs  closely  match  those 
obtained  for  thresholded  similarity  graphs  obtained 
from  actual  document  corpora.  As  such,  we  will  use  the 
random  graph  model  for  analysis  and  for  experimental 
verification  of  the  algorithms  presented  in  this  paper  (in 
addition  to  experiments  on  actual  corpora). 

The  following  upper  bound  on  the  expected  simi¬ 
larity  between  two  satellite  vertices  holds: 

Proposition  2.2.  The  expected  similarity  be¬ 
tween  two  satellite  vertices  Si  and  S2  in  the  same  star 
in  a  similarity  graph  G„  is: 

a 

cos  ai  cos  a2  +  - - sin  ai  sin  a2. 

1  +  cr 

Proof.  (Omitted  for  space  considerations.) 

Note  that  for  the  previous  example,  the  above 
formula  would  predict  a  similarity  between  satellite 


4Note  that  sin#  =  \/l  —  cos2  6. 
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vertices  of  approximately  0.78.  We  have  tested  this 
formula  against  real  data,  and  the  results  of  the  test 
with  the  MEDLINE  data  set5  are  shown  in  Figure  1. 
In  this  plot,  the  x-  and  y- axes  are  similarities  between 
cluster  centers  and  satellite  vertices,  and  the  z-axis 
is  the  actual  mean  squared  prediction  error  (MSE)  of 
the  above  formula  for  the  similarity  between  satellite 
vertices.  Note  that  the  maximum  root  mean  squared 
error  (RMS)  is  quite  small  (approximately  0.13  in  the 
worst  case),  and  for  reasonably  high  similarities,  the 
error  is  negligible.  From  our  tests  with  real  data,  we 
have  concluded  that  the  random  graph  model  holds 
and  that  this  formula  is  quite  accurate.  We  can 
conclude  that  star-shaped  subgraphs  are  reasonably 
“dense”  in  the  sense  that  they  imply  relatively  high 
pairwise  similarities  between  documents. 

medline 

mean  square  error 

0.025  -  - 


Figure  1:  This  figure  shows  the  error  for  a  6000  abstract 
subset  of  MEDLINE. 

3  The  off-line  star  algorithm 

Motivated  by  the  discussion  of  the  previous  section,  we 
now  present  the  star  algorithm  which  can  be  used  to 
organize  documents  in  an  information  system.  The  star 
algorithm  is  based  on  a  greedy  cover  of  the  thresholded 
similarity  graph  by  star-shaped  subgraphs;  the  algo¬ 
rithm  itself  is  summarized  in  Figure  2  below. 

Theorem  3.1.  The  running  time  of  the  off-line 
star  algorithm  on  a  similarity  graph  Ga  is  0(E  -f-  Ea). 

Proof.  The  following  implementation  of  this  algo¬ 
rithm  has  a  running  time  linear  in  the  size  of  the  graph. 
Each  vertex  v  has  a  data  structure  associate  with  it  that 
contains  v. degree,  the  degree  of  the  vertex,  v.adj,  the 
list  of  adjacent  vertices,  v.marked,  which  is  a  bit  de¬ 
noting  whether  the  vertex  belongs  to  a  star  or  not,  and 

5  MEDLINE  is  a  large  collection  of  medical  abstracts  that  is 
often  used  as  benchmark  in  information  retrieval  experiments. 


For  any  threshold  er: 

1.  Let  Ga  —  ( V ,  Ea)  where  Ea  =  {e  :  w(e)  >  a). 

2.  Let  each  vertex  in  Ga  initially  be  unmarked. 

3.  Calculate  the  degree  of  each  vertex  v  6  V. 

4.  Let  the  highest  degree  unmarked  vertex  be  a 
star  center,  and  construct  a  cluster  from  the  star 
center  and  its  associated  satellite  vertices.  Mark 
each  node  in  the  newly  constructed  cluster. 

5.  Repeat  step  4  until  all  nodes  are  marked. 

6.  Represent  each  cluster  by  the  document  corre¬ 
sponding  to  its  associated  star  center. 

Figure  2:  The  star  algorithm 


v.center,  which  is  a  bit  denoting  whether  the  vertex 
is  a  star  center.  (Computing  v.degree  for  each  vertex 
can  easily  be  performed  in  @(E  +  Ea)  time.)  The  im¬ 
plementation  starts  by  sorting  the  vertices  in  V  by  de¬ 
gree  (0(E)  time  since  degrees  are  integers  in  the  range 
{0,  |E|}).  The  program  then  scans  the  sorted  vertices 
from  the  highest  degree  to  the  lowest  as  a  greedy  search 
for  star  centers.  Only  vertices  that  do  not  belong  to  a 
star  already  (that  is,  they  are  unmarked)  can  become 
star  centers.  Upon  selecting  a  new  star  center  v,  its 

v. center  and  v.marked  bits  are  set  and  for  all  w  G  v.adj, 

w. marked  is  set.  Only  one  scan  of  V  is  needed  to  de¬ 
termine  all  the  star  centers.  Upon  termination,  the  star 
centers  and  only  the  star  centers  have  the  center  field 
set.  We  call  the  set  of  star  centers  the  star  cover  of  the 
graph.  Each  star  is  fully  determined  by  the  star  center, 
as  the  satellites  are  contained  in  the  adjacency  list  of 
the  center  vertex. 

This  algorithm  has  two  features  of  interest.  The 
first  feature  is  that  the  star  cover  is  not  unique.  A 
similarity  graph  may  have  several  different  star  covers 
because  when  there  are  several  vertices  of  the  same 
highest  degree,  the  algorithm  arbitrarily  chooses  one 
of  them  as  a  star  center  (whichever  shows  up  first 
in  the  sorted  list  of  vertices).  The  second  feature  of 
this  algorithm  is  that  it  provides  a  simple  encoding 
of  a  star  cover  by  assigning  the  types  “center”  and 
“satellite”  (which  is  the  same  as  “not  center”  in  our 
implementation)  to  vertices.  We  define  a  correct  star 
cover  as  a  star  cover  that  assigns  the  types  “center” 
and  “satellite”  in  such  a  way  that  (1)  a  star  center  is  not 
adjacent  to  any  other  star  center  and  (2)  every  satellite 
vertex  is  adjacent  to  at  least  one  center  vertex  of  higher 
degree.  It  immediately  follows  that: 
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Theorem  3.2.  The  off-line  star  algorithm,  pro¬ 
duces  a  correct  star  cover. 

We  will  use  the  two  features  of  the  off-line  algorithm 
mentioned  above  in  the  analysis  of  the  on-line  version 
of  the  star  algorithm,  in  the  next  section. 

4  Clustering  dynamic  data  with  the  star 

algorithm 

In  this  section  we  consider  algorithms  for  computing 
the  organization  of  a  dynamic  information  system.  We 
derive  an  on-line  version  of  the  star  algorithm  for  in¬ 
formation  organization  that  can  incrementally  compute 
clusters  of  similar  documents.  We  continue  assuming 
the  vector  space  model  and  the  cosine  metric  to  cap¬ 
ture  the  pairwise  similarity  between  the  documents  of 
the  corpus,  and  the  random  graph  model  for  analyzing 
the  expected  behavior  of  the  new  algorithm. 

We  assume  that  documents  are  inserted  or  deleted 
from  the  collection  one  at  a  time.  For  simplicity, 
we  will  focus  our  discussion  on  adding  documents  to 
the  collection.  The  delete  algorithm  is  similar.  The 
intuition  behind  the  incremental  computation  of  the 
star  cover  of  a  graph  after  a  new  vertex  is  inserted  is 
depicted  in  Figure  3.  The  top  figure  denotes  a  similarity 
graph  and  a  correct  star  cover  for  this  graph.  Suppose 
a  new  vertex  is  inserted  in  the  graph,  as  in  the  middle 
figure.  The  original  star  cover  is  no  longer  correct  for 
the  new  graph.  The  bottom  figure  shows  the  correct 
star  cover  for  the  new  graph.  How  does  the  addition  of 
this  new  vertex  affect  the  correctness  of  the  star  cover? 
In  general,  the  answer  depends  on  the  degree  of  the  new 
vertex  and  on  its  adjacency  list.  If  the  adjacency  list  of 
the  new  vertex  does  not  contain  any  star  centers,  the 
new  vertex  can  be  added  in  the  star  cover  as  a  star 
center.  If  the  adjacency  list  of  the  new  vertex  contains 
any  center  vertex  c  whose  degree  is  higher,  the  new 
vertex  becomes  a  satellite  vertex  of  c.  The  difficult 
case  that  destroys  the  correctness  of  the  star  cover  is 
when  the  new  vertex  is  adjacent  to  a  collection  of  star 
centers,  each  of  whose  degree  is  lower  than  that  of  the 
new  vertex.  In  this  situation,  the  star  structure  already 
in  place  has  to  be  modified  to  assign  the  new  vertex  as 
a  star  center.  The  satellite  vertices  in  the  stars  that  are 
broken  as  a  result  have  to  be  re-evaluated. 

4.1  The  on-line  star  algorithm 

Motivated  by  the  intuition  in  the  previous  section,  we 
now  describe  an  on-line  algorithm  for  incrementally 
computing  star  covers  of  dynamic  graphs.  The  algo¬ 
rithm  is  shown  in  Figure  4.  This  algorithm  uses  a  data 
structure  to  efficiently  maintain  the  star  covers  of  an 
undirected  graph  G  =  ( V,E ).  For  each  vertex  v  €  V, 
we  maintain  the  following  data. 


Figure  3:  This  figure  shows  the  star  cover  change  after 
the  insertion  of  a  new  vertex.  The  larger-radius  disks 
denote  star  centers,  the  other  disks  denote  satellite 
vertices.  The  star  edges  are  denoted  by  solid  lines.  The 
inter-satellite  edges  are  denoted  by  dotted  lines.  The 
top  figure  shows  an  initial  graph  and  its  star  cover.  The 
middle  figure  shows  the  graph  after  the  insertion  of  a 
new  document.  The  bottom  figure  shows  the  star  cover 
of  the  new  graph. 


v.type  satellite  or  center 

v. degree  degree  of  v 

v.adj  list  of  adjacent  vertices 

v. centers  list  of  adjacent  centers 

v.inQ  flag  specifying  if  v  being  processed 
Note  that  while  v.type  can  be  inferred  from  v. centers 
and  v. degree  can  be  inferred  from  v.adj,  it  will  be  con¬ 
venient  to  have  all  five  pieces  of  data  in  the  algorithm. 
Let  a  be  a  vertex  to  be  added  to  G,  and  let  L  be  the 
list  of  vertices  in  G  which  are  adjacent  to  a.  The  al¬ 
gorithm  in  Figure  4  will  appropriately  update  the  star 
cover  of  G. 

The  algorithm  maintains  a  priority  queue  Q  of 
vertices  not  yet  correctly  placed  in  the  star  cover.  When 
a  star  is  broken,  its  center  and  satellites  are  placed  in 

Q ■  r 

The  on-line  star  cover  algorithm  is  more  complex 
than  its  off-line  counterpart.  We  devote  the  rest  of  this 
section  to  proving  that  the  algorithm  is  correct  and  to 
analyzing  its  expected  running  time. 
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UPDATE(a,L) 

1  a. type  <—  satellite 

2  a. degree  t—  0 

3  a.adj  <—  0 

4  a. centers  <—  0 

5  forall  /3  in  L 

6  a. degree  ■<—  a. degree  +  1 

7  p.  degree  4—  p.  degree  +  1 

8  Insert(/3,  a.adj) 

9  lNSERT(a,/3.adj) 

10  if  ( P-type  —  center) 

11  lNSERT(/3,  a.  centers) 

12  else 

13  p.inQ  4-  true 

14  Enqueue(/3,  Q) 

15  endif 

16  endfor 

17  a.inQ  4—  true 

18  ENQUEUE(a,Q) 

19  while  (Q  ^  0) 

20  <p  <-  ExtractMax(Q) 

21  if  (4>.  center s  —  $) 

22  4>-tyPe  center 

23  forall  P  in  (p.adj 

24  Insert  ((p,  p.  centers) 

25  endfor 

26  else 

27  if  (VJ  £  <p.  centers,  6.  degree  <  <j>.  degree) 

28  <P-type  4—  center 

29  forall  P  in  (p.  adj 

30  Insert(<?!»,  P.  centers) 

31  endfor 

32  forall  S  in  cp.  centers 

33  6. type  4—  satellite 

34  forall  p  in  5.  adj 

35  D ELETE (6,  p. centers) 

36  if  {p.inQ  =  false) 

37  p.inQ  4-  true 

38  Enqueue^,  Q)  __ 

39  endif 

40  endfor 

41  endfor 

42  endif 

43  endif 

44  p.inQ  4-  false 

45  endwhile 


Figure  4:  The  on-line  star  algorithm  for  clustering. 


4.2  Correctness 

In  this  section  we  show  that  the  on-line  algorithm  is 
correct  by  proving  that  it  produces  the  same  star  cover 


as  the  off-line  algorithm,  when  the  off-line  algorithm 
is  run  on  the  final  graph  considered  by  the  on-line 
algorithm.  Before  we  state  the  result,  we  note  that  the 
off-line  star  algorithm  does  not  produce  a  unique  cover. 
When  there  are  several  vertices  of  the  same  highest 
degree,  the  algorithm  arbitrarily  chooses  one  of  them 
as  the  next  star  center.  We  will  show  that  the  cover 
produced  by  the  on-line  star  algorithm  is  the  same  as 
one  of  the  covers  that  can  be  produced  by  the  off-line 
algorithm 

THEOREM  4.1.  The  cover  generated  by  the  on-line 
star  algorithm  when  Ga  =  (V,Ea)  is  constructed  in¬ 
crementally  (by  inserting  its  vertices  one  at  a  time)  is 
identical  to  some  legal  cover  generated  by  the  off-line 
star  algorithm  on  Ga . 

Proof.  We  can  view  a  star  cover  of  G„  as  a  correct 
assignment  of  types  (that  is,  “center”  or  “satellite”)  to 
the  vertices  of  Ga.  The  off-line  star  algorithm  assigns 
correct  types  to  the  vertices  of  Ga.  We  will  prove  the 
correctness  of  the  on-line  star  cover  by  induction.  The 
induction  invariant  is  that  at  all  times,  the  types  of 
vertices  not  in  Q  are  correct,  assuming  that  the  true 
type  of  vertices  in  Q  is  “satellite.”  This  would  imply 
that  when  Q  is  empty,  all  vertices  are  assigned  a  correct 
type,  and  thus  the  star  cover  is  correct. 

The  invariant  is  true  initially:  as  the  type  of  the  new 
node  a  is  unknown  and  a  is  in  Q;  the  type  of  all  the 
satellite  neighbors  of  a  are  unknown  and  these  neighbors 
are  in  Q;  and  all  the  other  vertices  have  correct  types 
from  the  original  cover,  assuming  that  the  nodes  in 
the  queue  are  correctly  satellite.  We  now  show  that 
the  induction  invariant  is  maintained  throughout  the 
algorithm.  Consider  Figure  4.  The  first  thing  to  note 
is  that  the  type  of  all  the  vertices  in  Q  is  “satellite”; 
statements  14,  18  and  33  enqueue  satellite  vertices.  We 
now  argue  that  every  time  a  vertex  (p  of  highest  degree 
is  pulled  out  of  Q,  it  is  assigned  a  correct  type.  When 
<p  has  no  centers  on  its  adjacency  list,  its  type  should 
be  “center”  (which  is  assigned  correctly  by  statement 
22).  When  <p  is  adjacent  to  star  centers  Si,  each  of 
which  has  a  lower  degree  that  <p,  the  correct  type  for  <p 
is  “center”  (statement  28).  This  action  has  a  side  effect: 
all  Si  cease  to  be  star  centers  and  thus  get  enqueued  for 
further  evaluation  (statements  32-39).  Otherwise,  the 
correct  type  for  is  the  default  “satellite” .  Since  (p  was 
extracted  from  Q  and  all  vertices  in  Q  are  satellites,  the 
type  of  <p  is  correct  in  this  case  as  well. 

To  complete  the  argument,  what  remains  to  be 
shown  is  that  eventually  the  queue  Q  becomes  empty. 
The  termination  of  the  while  loop  at  statement  19  in 
Figure  4  is  guaranteed  by  ithe  following  result. 

LEMMA  4.1.  The  degree  of  the  stars  broken  by  the 
on-line  star  algorithm  is  strictly  decreasing. 
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The  lemma  is  equivalent  to  the  following  statement: 
node  <pm  Q  has  the  potential  of  becoming  a  star  center 
and  has  the  capability  of  adding  new  nodes  7  to  Q  that 
can  become  stars  of  degree  strictly  less  than  the  degree 
of  node  <p. 

Suppose  <p  becomes  a  new  star  center.  We  show 
than  its  satellite  neighbors  7  cannot  become  star  cen¬ 
ters.  Two  cases  arise.  (Case  1)  7 ;  is  not  a  star  center 
because  its  degree  is  smaller  than  the  degree  of  the  new 
star  center  that  covers  (p  in  the  new  cover.  (Case  2)  7 4 
is  not  a  star  center  because  it  is  a  satellite  of  a  much 
larger  star,  so  its  degree  is  larger  than  the  degree  of  the 
new  star  that  covers  <p.  But  this  condition  still  holds 
after  making  the  new  star.  This  completes  the  proof 
sketch  for  the  termination  lemma  and  it  follows  that 
the  types  assigned  by  the  on-line  algorithm  are  correct; 
in  other  words,  that  there  exists  an  off-line  algorithm 
that  produces  the  same  cover. 

4.3  Running  Time  Analysis  and  Experimental 
Results 

In  this  section,  we  argue  that  the  running  time  of  the  on¬ 
line  star  algorithm  is  efficient  in  practice,  asymptotically 
matching  the  running  time  of  the  off-line  star  algorithm 
(©(V  +  E))  to  within  lower  order  factors.  We  first 
note,  however,  that  there  exist  worst-case  thresholded 
similarity  graphs  Ga  and  corresponding  vertex  insertion 
sequences  which  cause  the  on-line  star  algorithm  to  run 
in  Q(C2)  time.6  These  graphs  and  insertion  sequences 
rarely  arise  in  practice  though.  An  analysis  more  closely 
modeling  practice  is  the  random  graph  model  in  which 
Ga  is  a  random  graph  and  the  insertion  sequence  is 
random.  In  this  model,  the  expected  running  time  of 
the  on-line  star  algorithm  can  be  determined. 

In  the  sections  that  follow,  we  first  give  intuition  for 
the  expected  running  time  of  the  on-line  star  algorithm. 
In  subsequent  sections,  we  give  experimental  results 
showing  that  the  on-line  star  algorithm  is  quite  efficient 
with  respect  to  both  random  data  and  a  large  collection 
of  real  documents. 

4.3.1  Intuition 

We  have  implemented  the  on-line  star  algorithm  using 
a  heap  for  the  priority  queue  and  simple  linked  lists  for 
the  various  lists  required.  The  time  required  to  insert 
a  new  vertex  and  associated  edges  into  a  thresholded 
similarity  graph  and  to  appropriately  update  the  star 
cover  is  largely  governed  by  the  number  of  stars  that  are 
broken  during  the  update,  since  breaking  stars  requires 

6 An  example  is  a  graph  consisting  of  two  connected  vertices  A 
and  B  of  very  high  but  identical  degree  (not  both  of  which  can  be 
star  centers)  and  an  insertion  sequence  which  causes  the  “local” 
center  to  repeatedly  switch  between  A  and  B. 


inserting  new  elements  into  the  priority  queue.  In 
practice,  very  few  stars  are  broken  during  any  given 
update.  This  is  due  partly  tp  the  fact  that  relatively  few 
stars  exist  at  any  given  time  (as  compared  to  the  number 
of  vertices  or  edges  in  the  thresholded  similarity  graph) 
and  partly  to  the  fact  that  the  likelihood  of  breaking  any 
individual  star  is  also  small.  We  begin  with  the  former, 
noting  that  the  number  of  stars  expected  to  cover  a 
random  graph  G„iP  is  only  0(logn). 

Theorem  4.2.  The  expected  size  of  the  star  cover 
for  Gr  -2^ 


)• 


Proof.  The  star  cover  algorithm  is  greedy:  it  iter¬ 
ates  by  selecting  the  highest  degree  vertex  not  yet  cov¬ 
ered  as  a  star  center  and  marking  this  node  and  all  its 
adjacent  vertices  as  covered.  Each  iteration  creates  a 
new  star.  We  will  argue  that  the  number  of  iterations 
is  l0g°^?  y  The  argument  relies  on  the  random  graph 
model  described  in  Section  2.1. 

Let  Gnp  correspond  to  a  random  graph  on  n 
vertices  where  each  edge  exists  with  probability  p.  The 
degree  of  each  vertex  of  G  is  distributed  binomially: 
TV [deg  =  k]  =  bin(km,  n  -  1  ,p)  =  ("j^1)  pk(  1  - 
p)n-i-*  The  mean  of  this  distribution  is  p  =  (n  — 
l)p  and  its  variance  is  a  =  yj(n  -  l)p(l  -p).  Note 
that  while  the  the  degrees  of  the  vertices  do  exhibit 
some  dependence,  for  practical  purposes  they  can  be 
considered  independent  [Bol95].  This  means  that  on 
average,  each  star  covers  (n  -  l)p  +  1  >  rip  vertices.7 
Since  the  np  vertices  covered  by  each  star  are  randomly 
chosen,  there  will  be  some  overlap  between  the  star 
covers.  Each  new  star  leaves  uncovered  a  (1  —  p) 
fraction  of  the  previously  uncovered  vertices.  In  other 


words,  after  the  first  iteration,  (1  —  p)n  vertices  remain 
uncovered.  After  i  iterations,  (1  —  p)’n  vertices  remain 
uncovered.  The  algorithm  terminates  when  all  the 
vertices  are  covered,  or  (1  -p)*n  <  1.  By  taking  logs  of 
both  sides  of  this  inequality,  it  follows  that  i  >  j--1^ " 


is  sufficient. 


i°g(rb) 


Thus,  the  number  of  stars  is  expected  to  be  rela¬ 
tively  small.  Furthermore,  the  probability  any  individ¬ 
ual  star  will  be  broken  is  quite  small  as  well.  A  star 
can  only  be  broken  if  the  star  center  has  the  same  de¬ 
gree  as  one  of  its  associated  satellite  vertices  and  if  the 
vertex  being  added  to  the  graph  is  connected  to  that 
satellite  but  not  to  the  star  center.8  In  practice,  the 
expected  number  of  stars  broken  during  an  update  is  a 
small  constant  even  for  graphs  containing  thousands  of 
vertices  (though  asymptotically  it  is  certainly  a  slowly 


’’The  star  covers  its  center  and  (n  —  l)p  satellites. 

8 Once  a  star  is  broken  during  an  update,  however,  other  stars 
can  be  broken  in  different  ways  via  a  cascading  effect. 
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growing  function  of  n).  In  Figure  5,  we  give  experimen¬ 
tal  results  showing  that  the  total  number  of  stars  broken 
during  runs  on  two  different  types  of  data  is  roughly  a 
linear  function  of  the  number  of  vertices;  thus,  the  ex¬ 
pected  number  of  stars  broken  during  any  given  update 
is  roughly  a  constant  (or  more  likely  a  slowly  growing 
function  of  n). 


Figure  5:  (The  dependence  of  the  number  of  broken 
stars  on  the  number  of  vertices  in  a  random  graph  (left) 
and  for  text  data  (right). 

The  time  to  break  a  star  is  roughly  proportional 
to  its  size  (the  degree  of  its  associated  star  center),  and 
since  the  degrees  of  all  vertices  are  expected  to  be  similar 
in  distribution  ( bin(k;n  -  l,p)),  this  is  on  the  order 
of  the  number  of  edges  being  inserted  into  the  graph. 
Since  only  a  constant  number  of  stars  are  expected  to 
be  broken,  the  expected  time  to  perform  an  update  will 
be  roughly  proportional  to  the  number  of  edges  inserted 
in  the  graph  during  the  update.  Thus,  the  total  time 
to  perform  n  updates  should  be  roughly  proportioned 
to  the  total  number  of  edges  in  the  final  graph.  In  the 
sections  that  follow,  we  give  experimental  results  which 
confirm  this  fact. 

4.3.2  Experimental  results 

We  have  experimented  with  the  on-line  clustering  algo¬ 
rithm  in  two  scenarios.  The  first  type  of  data  matches 
our  random  graph  model  and  consists  of  random  sim¬ 
ilarity  graphs.  While  this  type  of  data  is  useful  as  a 
benchmark  for  the  running  time  of  the  algorithm,  it 
does  not  satisfy  the  geometric  constraints  of  the  vec¬ 
tor  space  model.  We  also  conducted  experiments  using 
real  data  from  the  TREC  collection9  as  a  second  type 
of  benchmark  for  the  algorithm. 

We  now  detail  our  data  generation  procedure  and 
the  experimental  running  time  of  the  on-line  star  algo¬ 
rithm  on  each  data  type. 


9TREC  is  the  annual  text  retrieval  conference.  TREC  is 
organized  as  a  competition.  Each  participant  is  given  on  the  order 
of  5  gigabytes  of  data  and  a  standard  set  of  queries  on  which  to 
test  their  systems.  The  results  and  the  system  descriptions  are 
presented  as  papers  at  the  TREC  conference. 


Generating  Random  Data.  We  ran  the  on¬ 
line  star  cover  algorithm  on  a  random  graph  with  1000 
nodes.  The  edges  in  this  graph  were  inserted  randomly 
with  probability  p  =  0.2.  The  on-line  algorithm  was 
run  30  times.  Each  time,  the  vertices  of  the  random 
graph  were  inserted  in  random  order.  The  results  were 
averaged  over  the  30  experiments.  Figure  6  shows  the 
data  from  these  experiments.  Note  that  the  the  running 
time  is  roughly  linear  in  the  number  of  edges  in  the 
graph,  and  we  can  see  the  effects  of  lower  order  terms. 


Figure  6:  This  figure  shows  the  dependence  of  the 
running  time  of  the  on-line  star  algorithm  on  the 
number  of  edges  in  a  random  graph  (left)  and  for  text 
data  (right). 

Experiments  with  real  data.  We  ran  the  on¬ 
line  star  cover  algorithm  on  a  document  collection 
that  consists  of  a  slice  of  TREC  documents  augmented 
with  our  department’s  technical  reports.  The  resulting 
collection  consists  of  2224  documents.  We  ran  four 
experiments.  Each  time  we  set  a  different  threshold 
and  added  the  similarity  graph  nodes  in  random  order. 
The  results  of  these  experiments  were  averaged  and  the 
running  time  measurements  appear  to  be  linear  in  the 
number  of  edges  of  the  similarity  graph.  Figure  6  shows 
the  data  from  these  experiments.  Note  that  the  the 
running  time  is  roughly  linear  in  the  number  of  edges 
in  the  graph,  and  we  can  see  the  effects  of  lower  order 
terms. 

Comparion  between  Star,  Single  Link,  and 
Average  Link  on  TREC  Data.  We  have  imple¬ 
mented  a  system  for  organizing  information  that  uses 
the  star  algorithm.  Figure  7  shows  the  user  inter¬ 
face  to  this  system  [APR97].  In  order  to  evaluate 
the  performance  of  our  system,  we  tested  the  star  al¬ 
gorithm  against  two  widely  used  clustering  algorithms 
in  IR:  the  single  link  method  [Rij79]  and  the  average 
link  method  [Voo85].  We  used  data  from  the  TREC-6 
conference  as  our  testing  medium.  The  TREC  collec¬ 
tion  contains  a  very  large  set  of  documents  of  which 
21,694  have  been  ascribed  relevance  data  with  respect 
to  47  topics.  These  21,694  documents  were  partitioned 
into  22  separate  subcollections  of  approximately  1,000 
documents  each  for  22  rounds  of  the  following  experi¬ 
ment.  For  each  of  the  47  topics  the  given  collection  of 
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documents  was  clustered  with  each  of  the  three  algo¬ 
rithms  and  the  best  cluster  was  returned.  To  measure 
the  quality  of  a  cluster,  we  use  the  common  E  mea¬ 
sure  [Rij79]  defined  as:  E(p,r)  =  1  -  2/(1  /p+  1/r), 
where  p  and  r  are  the  standard  precision  and  recall  of 
the  cluster  with  respect  to  the  set  of  documents  rel¬ 
evant  to  the  topic.10  It  is  worthwhile  to  note  that 
in  viewing  data  comparing  two  clustering  methods, 
lower  E{p,r)  values  correspond  to  better  performance. 
Averaging  over  all  22  experiments,  we  find  that  the 
mean  (p,r,  E(p,r))  values  for  star,  average-link  and 
single-link  are  (0.77,0.54,0.36),  (0.83,0.44,0.42)  and 
(0.84,0.41,0.45),  respectively.  Thus,  the  star  algorithm 
represents  a  16.6%  improvement  in  performance  with 
respect  to  average-link  and  an  25%  improvement  with 
respect  to  single-link. 

Figure  8  shows  the  detailed  E  measures  for  the  star 
algorithm  vs.  the  single  link  algorithms  and  for  the  star 
algorithm  vs.  the  average  link  algorithm.  We  collected 
all  the  topic  clusters  computed  in  this  experiment. 
We  sorted  the  clusters  produced  by  the  star  algorithm 
according  to  their  .E-values.  We  plotted  the  E  value  for 
the  coresponding  cluster  computed  by  the  single  link 
algorithm  (see  the  oscillating  line  in  Figure  8-left)  and 
for  the  average  link  algorithm  (see  the  oscillating  line 
in  Figure  8-right).  We  note  that  the  star  algorithm 
outperforms  both  the  single  link  algorithm  and  the 
average  link  algorithm,  because  the  E  values  for  the 
star  clusters  are  almost  everywhere  lower  than  the 
corresponding  values  for  the  other  two  algorithms.  Note 
that  not  all  topics  are  present  in  all  22  experiments, 
which  is  why  we  have  only  approximately  500  clusters 
in  these  graphs. 

Our  experiments  show  that  in  general,  the  star  al¬ 
gorithm  outperforms  single  link  by  25%  and  that  it  out¬ 
performs  average  link  by  16.6%.  We  repeated  this  ex¬ 
periment  on  the  same  data,  using  one  collection  only 
(of  21,694  documents),  and  obtained  similar  results.11 
These  improvements  are  significant  for  text  applica¬ 
tions.  Considering  that  the  star  algorithm  outperforms 
the  average  link  algorithm,  it  is  easier  to  implement  than 
the  average  link  algorithm,  it  can  be  used  as  an  on-line 
algorithm,  and  it  runs  much  faster,  these  experiments 
provide  support  for  using  the  star  algorithm  in  cluster¬ 
ing  and  off-line  information  organization. 


“'Precision  is  the  fraction  of  returned  documents  that  are  cor¬ 
rect.  Recall  is  the  fraction  correct  documents  that  are  returned. 

11  The  precision,  recall,  and  E  values  for  star,  average  link,  and 
single  link  were  (.53,  .32,  .61),  (.63,  .25,  .64),  and  (.66,  .20,  .70), 
respectively.  We  note  that  the  E  measures  are  worse  for  all  three 
algorithms  on  this  larger  collection  and  that  the  star  algorithm 
outperforms  average  link  by  10.3%  and  single  link  by  20.7%. 
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Figure  7:  This  is  a  screen  snapshot  from  a  clustering 
experiment.  The  top  window  is  the  query  window.  The 
middle  window  consists  of  a  ranked  list  of  documents 
that  were  retrieved  in  response  to  the  user  query.  The 
user  my  select  “get”  to  fetch  a  document  or  “graph” 
to  request  a  graphical  visualization  of  the  clusters  as 
in  the  bottom  window.  The  left  graph  displays  all 
the  documents  as  dots  around  a  circle.  Clusters  are 
separated  by  gaps.  The  edges  denote  pairs  of  documents 
whose  similarity  falls  between  the  slider  parameters. 
The  right  graph  displays  all  the  clusters  as  disks.  The 
radius  of  a  disk  is  proportional  to  the  size  of  the  cluster. 
The  distance  between  the  disks  is  proportional  to  the 
similarity  distance  between  the  clusters. 


5  Discussion 

We  presented  and  analyzed  an  off-line  clustering  algo¬ 
rithm  for  static  information  organization  and  an  on-line 
clustering  algorithm  for  dynamic  information  organiza¬ 
tion.  We  discussed  the  random  graph  model  for  analyz¬ 
ing  these  algorithms  and  showed  that  in  this  model,  the 
algorithms  have  expected  running  time  that  is  roughly 
linear  in  the  number  of  edges.  The  data  we  gathered 
from  experimenting  with  these  algorithms  provides  sup¬ 
port  for  the  validity  of  our  model  and  analysis.  The  em¬ 
pirical  tests  show  that  both  algorithms  have  an  asymp¬ 
totic  linear  time  performance  in  the  number  of  edges  in 
the  graph.  In  addition,  both  algorithms  are  simple  and 
easy  to  implement.  We  believe  that  the  fast  running 
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Figure  8:  This  figure  shows  the  E  measure  for  the 
star  clustering  algorithm  vs.  the  single  link  clustering 
algorithm  (left)  and  the  star  algorithm  vs.  the  average 
link  algorithm  (right).  The  y  axis  shows  the  E  measure. 
The  x  axis  shows  the  cluster  number.  Clusters  have 
been  sorted  according  to  the  E  value  for  the  star 
algorithm. 

time  and  the  ease  of  implementation  make  these  algo¬ 
rithms  very  practical  candidates  for  use  in  automatically 
organizing  digital  libraries. 

This  work  departs  from  previous  clustering  algo¬ 
rithms  used  in  information  retrieval  that  use  a  fixed 
number  of  clusters  for  partitioning  the  space.  Since 
the  number  of  clusters  produced  by  our  algorithms  is 
given  by  the  underlying  topic  structure  in  the  informa¬ 
tion  system,  our  clusters  are  dense  and  accurate.  Our 
work  extends  previous  results  [HP96]  that  support  using 
clustering  for  browsing  applications  and  presents  posi¬ 
tive  evidence  for  the  cluster  hypothesis.  In  [APR97],  we 
argue  that  by  using  a  clustering  algorithm  that  guaran¬ 
tees  the  cluster  quality  through  separation  of  dissimilar 
documents  and  aggregation  of  similar  documents,  clus¬ 
tering  is  beneficial  for  information  retrieval  tasks  that 
require  high  precision  and  high  recall.  Precision-recall 
are  the  standard  measurements  for  the  performance  of 
an  information  retrieval  algorithm  [Sa.189] . 
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Abstract 

We  present  three  scalable  extensions  of  the  star  algorithm  for  information  organization  that  use 
sampling.  The  star  algorithm  organizes  a  document  collection  into  clusters  that  are  naturally  induced 
by  the  topic  structure  of  collection,  via  a  computationally  efficient  cover  by  dense  subgraphs.  We 
also  provide  supporting  data  from  extensive  experiments. 

1  Introduction 

Our  goal  is  to  develop  a  completely  automated  information  organization  system  for  digital  libraries, 
automated  tools  for  librarians  to  classify  this  information,  automatic  tools  to  create  reference  pointers 
into  such  collections,  and  automated  tools  that  allow  users  to  locate  information  effectively. 

We  focus  on  static  and  dynamic  digital  collections  of  unstructured  text.  We  consider  the  problem  of 
determining  the  topic  structure  of  text  data,  without  a  priori  knowledge  of  the  number  of  topics  in  the 
data  or  any  other  information  about  their  composition.  We  assume  that  the  collections  may  be  static 
(for  example,  digital  legacy  collections)  or  dynamic  (for  example,  news  wires).  We  look  to  discover 
hierarchies  of  topics  and  subtopics  in  such  text  collections.  Thus,  we  develop  clustering  algorithms  that 
can  be  used  in  off-line,  on-line,  and  hierarchical  mode.  We  wish  for  these  algorithms  to  be  fast,  scalable, 
accurate,  and  to  discover  the  naturally  occurring  topics  in  the  collection.  In  our  previous  work  (Aslam  et 
al,  1998;  Aslam  et  al,  1999),  we  proposed  an  off-line  and  an  on-line  approach  based  on  graph  theory.  Our 
algorithms,  called  the  star  clustering  algorithms,  compute  clusters  induced  by  the  natural  topic  structure 
of  the  space.  Thus,  this  work  is  different  than  previous  work  in  using  clustering  to  organize  information 
(Cutting  et  al,  1993;  Charikar  et  al,  1997)  in  that  we  do  not  impose  the  constraint  to  use  a  fixed  number 
of  clusters.  This  previous  work  argues  that  the  star  algorithm  is  simple,  efficient,  can  be  used  in  off-line 
as  well  as  on-line  mode,  and  it  outperforms  existing  clustering  algorithms  such  as  single  link,  average 
link,  and  k-means.  In  this  paper  we  consider  scalability  issues  in  developing  an  information  organization 
system.  We  present  three  different  scalable  extensions  to  the  star  algorithm  and  show  data  from  extensive 
experiments. 


2  Related  Work 


There  has  been  extensive  research  on  clustering  and  applications  to  many  domains  (Everitt,  1993;  Mirkin 
1996;  Silverstein  and  Pedersen  1997;  Sibson,  1973;  Worona,  1971).  For  a  good  overview  see  (Jain  and 
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Dubes,  1988).  For  a  good  overview  of  using  clustering  in  Information  Retrieval  (IR)  see  (Willett,  1988). 
The  use  of  clustering  in  IR  was  mostly  driven  by  the  cluster  hypothesis  (Rijsbergen,  1979)  which  states 
that  relevant  documents  tend  to  be  more  closely  related  to  each  other  than  to  non-relevant  documents. 
Efforts  have  been  made  to  find  whether  the  cluster  hypothesis  is  valid.  Voorhees  (Voorhees,  1985) 
discusses  a  way  of  evaluating  whether  the  cluster  hypothesis  holds  and  shows  negative  results.  Croft 
(Croft,  1080)  describes  a  method  for  bottom-up  cluster  search  that  could  be  shown  to  outperform  a 
full  ranking  system  for  the  Cranfield  collection.  In  (Jardine  and  van  Rijsbergen,  1971)  Jardine  and  van 
Rijsbergen  show  some  evidence  that  search  results  could  be  improved  by  clustering.  Hearst  and  Pedersen 
(Hearst  and  Pedersen,  1996)  re-examine  the  cluster  hypothesis  by  focusing  on  the  Scatter/Gather  system 
(Cutting  et  al,  1993)  and  conclude  that  it  holds  for  browsing  tasks. 

Systems  like  Scatter/Gather  (Cutting  et  al,  1993)  provide  a  mechanism  for  user-driven  organization  of 
data  in  a  fixed  number  of  clusters,  but  the  users  need  to  be  in  the  loop  and  the  computed  clusters  do 
not  have  accuracy  guarantees.  Scatter/Gather  uses  fractionation  to  compute  nearest-neighbor  clusters. 
Charika  et  al.  (Charikar  et  al,  1997)  consider  a  dynamic  clustering  algorithm  to  partition  a  collection 
of  text  documents  into  a.  fixed  number  of  clusters.  Since  in  dynamic  information  systems  the  number 
of  topics  is  not  known  a  priori,  a  fixed  number  of  clusters  cannot  generate  a  natural  partition  of  the 
information. 


3  Background:  The  Star  Algorithm  for  Information  Organization 


For  any  threshold  cr. 

1.  Let  Ga  —  ( V ,  Ea)  where  Ea  =  {e  :  w(e)  >  a}. 

2.  Let  each  vertex  in  Ga  initially  be  unmarked. 

3.  Calculate  the  degree  of  each  vertex  v  e  V. 

4.  Let  the  highest  degree  unmarked  vertex  be  a  star  center,  and  construct  a  cluster  from  the  star 
center  and  its  associated  satellite  vertices.  Mark  each  node  in  the  newly  constructed  cluster. 

5.  Repeat  step  4  until  all  nodes  are  marked. 

6.  Represent  each  cluster  by  the  document  corresponding  to  its  associated  star  center. 


Figure  1:  The  star  algorithm 

To  compute  accurate  topic  clusters,  one  possibility  is  to  formalize  clustering  as  covering  similarity  graphs 
by  cliques.  A  clique  cover  will  guarantee  that  its  documents  are  strongly  related  to  each  other.  Covering 
by  cliques  is  NP-complete,  and  thus  intractable  for  large  document  collections.  Unfortunately,  it  has  also 
been  shown  that  the  problem  cannot  even  be  approximated  in  polynomial  time  (Zuckerman,  1993).  We 
instead  propose  using  a  cover  by  dense  subgraphs  that  are  star-shaped  and  that  can  be  computed  off-line 
for  static  data  and  on-line  for  dynamic  data.  What  we  lose  in  intra-cluster  similarity  guarantees,  we  gain 
in  computational  efficiency. 

We  represent  the  document  collection  as  a  complete  similarity  graph,  where  the  vertices  correspond  to 
documents  and  the  edges  are  weighted  by  a  similarity  measure.  We  have  used  two  measures:  the  cosine 
metric  and  an  information-theoretic  metric. 

To  compute  accurate  topic  clusters,  we  create  a  thresholded  similarity  graph,  where  the  thresholding 
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parameter  is  given  by  the  smallest  similarity  we  would  like  to  have  between  any  documents  within  a 
topic.  We  then  approximate  a  clique  cover  of  this  graph  by  covering  the  associated  thresholded  similarity 
graph  with  star-shaped  subgraphs.  A  star-shaped  subgraph  onm  +  1  vertices  consists  of  a  single  star 
center  and  m  satellite  vertices,  where  there  exist  edges  between  the  star  center  and,  each  of  the  satellite 
vertices.  A  greedy  algorithm  (see  Figure  1)  computes  this  cover  for  static  collections.  In  (Aslam  et 
al,  1998;  Aslam  et  al,  1999)  we  show  an  on-line  version  of  this  algorithm  that  supports  information 
organization  in  dynamic  collection. 

Star-graph  covers  are  interesting  because  they  provide  accuracy  guarantees  on  the  computed  topics.  By 
investigating  the  geometry  of  the  problem,  we  can  derive  a  lower  bound  on  the  similarity  between  satellite 
vertices  as  well  as  provide  a  formula  (cos 7  >  cose*!  cosa2  +  y—  sinoj  sin a2,  where  a\  and  o2 
correspond  to  the  similarity  between  the  center  and  the  two  satellites  and  o  is  the  similarity  threshold) 
for  the  expected  similarity  between  satellite  vertices  using  the  cosine  metric.  This  formula  predicts  that 
the  pairwise  similarity  between  satellite  vertices  in  a  star-shaped  subgraph  is  high,  and  together  with 
empirical  evidence  supporting  this  formula  (Aslam  et  al,  1998). 


4  Scalable  Extensions  for  the  Star  Algorithm 


For  any  threshold  a: 

1.  Let  D  be  a  set  of  n  documents  sorted  in  random  order  in  an  array. 

2.  Let  s  be  the  sample  size. 

3.  Compute  a  Star  Cover  for  D[l.  .s]  and  let  C  be  the  list  of  star  centers  of  this  cover. 

4.  For  each  document  D[i]  in  D[s  +  l..n] 

•  For  each  cluster  C[j]  in  C:  if  similarity(D[i],  C\j ])  >  o  insert  D[i]  in  C[j] 

•  If  D[i]  was  not  inserted  in  any  existing  cluster,  create  a  new  cluster  with  D[i]  as  a  center 
and  add  this  cluster  to  C. 


Figure  2:  The  sampled  star  algorithm. 

In  this  section  we  present  three  extensions  to  the  star  algorithm  that  optimize  its  performance.  The  three 
algorithms  compute  approximations  to  the  star  cluster  but  optimize  on  the  size  of  the  similarity  matrix 
used  and  on  the  time  required  to  generate  it. 

Both  of  the  off-line  and  on-line  versions  of  the  star  algorithm  rely  on  the  existence  of  the  similarity 
matrix.  Similarity  matrices  can  get  very  large:  for  a  document  set  with  n  documents  the  similarity 
matrix  is  0(n 2)  space  data  structure.  However,  this  operation,  which  takes  0(n2)  time  to  compute’,  is 
much  more  expensive  than  the  basic  cost  of  the  star  clustering  algorithm,  which  is  approximately  0(n ) 
time.  Thus,  it  is  clear  that  the  similarity  matrix  is  a  bottleneck.  Computing  this  matrix  is  a  one-time 
pre-processing  operation.  However,  the  data  structure  has  to  be  available  on  a  permanent  basis.  For  these 
reasons,  we  now  investigate  several  methods  to  improve  on  the  similarity  matrix  bottleneck. 

'Note  that  the  actual  time  is  0(n2)  times  the  cost  of  a  vector  dot  product;  because  the  vectors  are  sparse,  this  translates  into 
0(n2)  with  a  high  constant. 
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4.1  Sampled  Stars 


The  first  approximation  algorithm  uses  sampling  to  compute  the  similarity  matrix  and  is  called  the  sam¬ 
pled  star  algorithm  (see  Figure  2).  The  basic  idea  behind  this  algorithm  is  to  create  a  sample  of  the 
document  collection  that  is  much  smaller  than  the  actual  collection.  This  sample  can  then  be  used  to 
compute  a  complete  Star  Clustering,  using  the  off-line  star  algorithm.  For  this  small  set,  the  computation 
of  the  similarity  matrix  is  much  faster.  Finally,  the  rest  of  the  documents  can  be  inserted  in  the  result¬ 
ing  clusters  fast  by  comparing  each  document  against  the  existing  star  centers  only.  Documents  that 
are  not  close  enough  to  any  existing  star  centers  (that  is,  all  distances  to  existing  star  centers  are  below 
the  threshold)  form  new  clusters.  Alternatively,  the  additional  documents  can  be  inserted  in  the  cluster 
structure  using  the  on-line  star  algorithm. 


4.2  Linear-space  Stars 


For  any  threshold  a: 

1 .  Let  D  be  a  set  of  n  documents,  p  a  desired  probability,  and  o  a  threshold. 

2.  Let  <7  =  0  denote  the  desired  clustering. 

3.  Select  a  sample  S  of  pairs  of  documents  (di,  d2)  from  D 

4.  For  each  pair  (dj,  d2)  in  S  if  the  dot  product  between  (di,  d2)  >  a  increase  the  degrees  of  d\ 
and  d2. 

5.  Sort  D  in  descending  order  by  degree. 

6.  Find  and  mark  all  the  star  centers  by  examining  one-by-one  the  sorted  D. 

7.  For  i  =  1  to  n  insert  d%  into  all  possible  star  centers. 


Figure  3:  The  linear  space  sampled  star  algorithm. 

The  sampled  star  algorithm  provides  a  more  effective  way  to  compute  the  overall  clustering  of  a  doc¬ 
ument  set  but  even  this  algorithm  requires  the  computation  of  a  complete  similarity  matrix  (which  is 
smaller  than  the  original  matrix).  An  additional  optimization  is  to  remove  entirely  the  similarity  matrix. 
The  key  information  used  by  the  star  algorithm  is  the  degree  of  the  nodes  in  the  thresholded  similarity 
graph.  This  information  can  be  represented  in  an  array.  A  trivial  algorithm  for  generating  the  array  is  to 
compare  every  document  against  every  other  document  and  count  the  number  of  vector  products  about 
the  threshold.  Note  that  this  method  reduces  significantly  the  space  requirements  but  still  necessitates 
0(n2)  time  to  generate,  where  n  is  the  number  of  documents.  An  alternative  is  to  compute  the  vertex 
degrees  approximately,  using  sampling.  For  each  document,  we  first  generate  a  sample  of  documents 
to  be  used  for  comparison.  A  dot  product  is  computed  between  the  document  and  each  member  of  the 
sample  set.  The  degree  of  the  document  vertex  is  given  by  the  number  of  dot  products  that  are  above  the 
threshold.  Figure  3  summarizes  this  algorithm. 


4.3  Distributed  Stars 

Another  bottleneck  for  the  star  algorithm  comes  up  in  Internet  applications,  such  as  organizing  data 
collected  from  various  sites  and  databases  by  topic.  Consider  a  task  in  which  several  databases  are 
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For  any  threshold  cr: 

1.  Let  D  be  a  set  of  n  documents.  Divide  D  into  k  disjoint  sets  D\ . . .  Dk- 

2.  Run  the  Star  algorithm  on  k  separate  machines  to  produce  the  star  clusterings  C\  . .  ,C\. 

3.  Let  c\...Cj  be  the  set  of  star  centers  in  all  the  star  covers. 

4.  Run  the  Star  algorithm  on  the  set  of  documents  c\...Cj. 

5.  If  two  star  centers  are  placed  in  the  same  cluster  in  the  previous  step,  merge  their  clusters 
using  a  union  operation. 

Figure  4:  The  distributed  star  algorithm. 


queried  with  the  same  question.  The  documents  returned  by  these  queries  are  to  be  fused  and  presented 
to  the  user  in  a  coherent  picture.  One  approach  is  to  run  the  queries,  download  all  documents,  and 
organize  the  entire  collection  at  the  user  site  using  the  star  algorithm.  An  alternative  approach  is  to  run 
the  queries,  organize  the  search  results  at  the  location  of  the  database,  and  then  merge  these  results  on 
the  user  machine.  This  second  alternative  has  several  advantages:  (1)  the  star  algorithm  can  be  mn  in 
parallel,  which  provides  a  speedup;  (2)  the  document  transfer  operation  can  also  be  parallelized2;  and 
(3)  the  local  topic  organizations  can  be  viewed  as  a  way  of  compressing  the  documents,  can  be  used  to 
generate  the  merged  topics  in  the  distributed  collection,  and  can  be  transfered  much  faster  than  the  actual 
documents  to  the  user’s  machine. 

For  these  reasons,  we  describe  a  third  approximation  of  the  star  algorithm  called  the  distributed  star 
algorithm,  which  is  useful  especially  when  the  document  collection  is  very  large.  The  distributed  star  al¬ 
gorithm  provides  parallelism  and  is  based  on  a  “divide  and  conquer”  approach.  The  document  collection 
is  partitioned  into  several  disjoint  sets.  The  sets  are  clustered  separately  and  the  resulting  clusters  are 
then  merged.  Figure  4  shows  the  details  of  this  algorithm.  Note  that  for  this  version  of  the  algorithm,  the 
off-line  Star  algorithm  can  be  replaced  with  the  Sampled  Star  algorithm  or  with  the  Linear  Space  Star 
algorithm. 


4.4  Experiments  and  Evaluations 

We  devised  two  experiments  for  the  purpose  of  testing  our  algorithms  on  real-world  data.  Because  we 
were  limited  by  computer  memory,  we  focused  the  experiments  on  the  Linear  Space  Sampled  Star  algo¬ 
rithm  (see  Figure  3)  which  was  introduced  to  optimize  both  time  performance  and  space  requirements. 

In  our  first  experiment,  we  ran  the  Linear  Space  Sampled  Star  algorithm  on  a  50000-  document  subset 
of  the  TREC  volume  1  corpus  at  various  sample  sizes.  We  compared  the  output  of  the  Linear  Space 
Sampled  Star  algorithm  with  sampling  to  its  output  without  sampling,  and  show  these  results  in  Figure  5. 
Note  that  when  sampling  is  not  used,  the  the  Linear  Space  Sampled  Star  algorithm  produces  the  same 
output  as  the  Star  algorithm. 

To  measure  the  difference  between  the  outputs  of  the  two  algorithms,  we  calculated  an  aggregate  preci¬ 
sion  and  recall  for  each  sample  size  as  follows.  For  each  cluster  x  in  the  output  of  the  sampled  algorithm, 
we  calculated  the  precision  and  recall  of  the  documents  in  z  against  each  cluster  in  the  output  of  the 
unsampled  algorithm.  We  then  determined  the  cluster  y  in  the  output  of  the  unsampled  algorithm  that 
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£' of  Linear  Space  Sampled  Star  With  50000  Documents 


Figure  5:  The  effects  of  the  sample  size  on  the  quality  of  clusters  obtained  using  the  Linear  Space 
Sampled  Star  algorithm.  The  rc-axis  shows  the  sample  size.  The  y-axis  shows  the  aggregate  E-measure 
computed  relative  to  the  star  algorithm.  The  smaller  the  E-value  is,  the  better  the  performance  is.  The 
experiment  was  done  with  a  TREC  subset  of  50000  documents. 


minimizes  van  Rijsbergen’s  (Rijsbergen,  1979)  evaluation  measure 


E(p,  r) 


1  - 


2 

1/p  +  1/r 


where  p  and  r  are  the  standard  precision  and  recall  of  the  cluster  with  respect  to  the  set  of  documents 
relevant  to  the  topic.  Finally,  we  calculated  a  weighted  average  E'  of  the  E-values  calculated  previously, 
weighting  each  E  value  by  the  number  of  documents  in  the  associated  cluster.  Figure  1  shows  the  results 
of  this  analysis.  With  larger  samples,  the  sampled  algorithm  generally  produced  the  exact  same  results 
as  the  algorithm  that  did  not  use  sampling.  As  the  portion  of  the  similarity  matrix  sampled  decreased, 
the  results  of  the  sampled  algorithm  deviated  increasingly  from  those  of  the  unsampled  algorithm. 


Our  subsequent  analyses  sought  to  determine  whether  the  divergent  output  of  the  sampled  algorithm 
was  inferior  to  the  output  of  the  unsampled  algorithm.  The  original  purpose  of  the  Star  algorithm  was 
to  calculate  a  cover  of  the  input  documents  using  as  few  star-shaped  clusters  as  possible  (Aslam  et  al, 
1998;  Aslam  et  al,  1999).  The  Linear  Space  Sampled  Star  algorithm  also  generates  a  cover  of  the  input 
documents  with  star-shaped  clusters,  so  we  compared  the  number  of  clusters  in  the  algorithm’s  output  at 
varying  sample  sizes  to  the  number  of  clusters  in  the  output  of  the  unsampled  algorithm  (see  Figure  6). 
Surprisingly,  even  with  samples  as  small  as  5%,  the  number  of  clusters  output  by  the  sampled  algorithm 
was  never  more  than  five  percent  larger  than  the  number  of  clusters  that  the  unsampled  algorithm  gen¬ 
erated.  In  fact,  the  sampled  algorithm  generally  covered  the  corpus  with  fewer  star-shaped  clusters  than 
unsampled  algorithm  did. 

Our  second  experiment  compared  the  output  of  the  Linear  Space  Sampled  Star  algorithm  against  cate¬ 
gorization  decisions  made  by  humans.  Specifically,  the  algorithm  was  run  on  4925  documents  from  the 
FBIS  corpus  that  had  been  labeled  by  humans  with  one  or  more  of  47  different  categories.  We  repeated 
the  precision/recall  analysis  of  the  first  experiment,  using  the  47  categories  in  the  place  of  the  output  of 
the  unsampled  Star  algorithm.  As  with  the  previous  experiment,  samples  as  small  as  1%  produced  results 


Number  of  Clusters  for  Linear  Space  Sampled  Star  with  50000 

Documents 


Figure  6:  The  effect  of  sampling  on  the  number  of  clusters  generates.  The  rr-axis  shows  the  sampling 
size.  The  y-axis  shows  the  ratio  between  the  number  of  clusters  generated  by  the  Linear  Space  Sampled 
Star  algorithm  to  the  number  of  clusters  generated  by  the  Star  algorithm.  We  observe  that  sampling  does 
not  affect  much  the  number  of  clusters  discovered  in  the  collection.  The  experiment  was  done  with  a 
TREC  subset  of  50000  documents. 

comparable  to  a  100%  sample  (See  Figure  7). 

Overall,  our  experiments  indicated  that  the  Linear  Space  Sampled  Star  algorithm  generates  output  com¬ 
parable  in  quality  to  that  of  the  Star  algorithm,  but  uses  considerably  fewer  CPU  and  memory  resources. 
Both  of  our  implementations  of  the  Linear  Space  Star  algorithm  required  only  81  megabytes  of  memory 
to  process  50000  documents,  73  megabytes  of  which  was  only  used  to  store  the  vector  representations 
of  the  documents.  On  the  other  hand,  an  implementation  of  the  Star  algorithm  that  uses  a  sparse  thresh- 
olded  similarity  matrix  would  require  approximately  2.5  gigabytes  of  memory  for  50000  documents,  and 
a  complete  similarity  matrix  stored  in  a  double-precision  floating-point  array  would  require  18.6  giga¬ 
bytes  of  memory.  The  gains  in  performance  due  to  sampling  were  similarly  significant.  Figure  8  shows 
the  amount  of  time  that  the  Linear  Space  Sampled  Star  algorithm  requires  to  process  50000  documents  at 
varying  sample  sizes.  These  times  were  measured  on  a  250  MHz.  MIPS  R 10000  and  do  not  include  the 
time  required  to  parse  the  documents.  We  found  the  running  time  of  the  algorithm  to  be  almost  directly 
proportional  to  the  size  of  the  sample.  At  sample  sizes  of  less  than  5%,  the  Linear  Space  Sampled  Star 
algorithm  organized  documents  at  an  average  rate  comparable  to  the  bandwidth  of  most  Internet  connec¬ 
tions  (See  Figure  9).  Tests  comparing  the  Star  algorithm  with  the  Linear  Space  Sampled  Star  algorithm 
on  smaller  data  sets  indicated  that  the  overhead  of  sampling  and  reducing  memory  requirements  result  in 
an  increase  in  running  time  of  less  than  5%. 

Finally,  we  have  conducted  a  small  experiment  on  1000  TREC  documents  to  study  the  performance  of 
the  Distributed  Star  algorithm  Figure  10  shows  the  accuracy  of  the  distributed  star  algorithm  relative  to 
the  off-line  star  algorithm.  We  note  that  when  the  number  of  computers  is  the  same  as  the  number  of 
documents,  Step  4  of  the  Distributed  Star  Algorithm  (Figure  4)  performs  a  star  clustering  of  the  entire 
collection.  The  same  is  true  when  there  is  a  single  machine.  The  greatest  degree  of  parallelism  and 
distribution  is  achieved  when  the  number  of  machines  is  y/m,  where  m  is  the  number  of  machines  in  the 
system.  For  this  experiment,  m  =  1000  and  \fm  is  approximately  32.  The  experiment  shows  that  the 
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E' of  Linear  Space  Sampled  Star  vs.  FBIS  Categorizations  of 
4925  Documents 
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Figure  7:  The  effect  of  sampling  on  the  quality  of  the  clustering  for  the  FBIS  collection.  The  ar-axis 
show  the  sampling  size.  The  y  axis  shows  the  E-measure  computed  relative  to  the  human  clustering. 

E-measure  for  32  machines  is  about  41  %. 


5  Conclusion 


We  presented  a  scalable  algorithm  for  information  organization.  Scalability  is  a  very  important  property 
for  information  organization  algorithms  especially  when  the  collections  are  dynamic  and  Web-based. 
We  implemented  these  algorithms  as  a  scalable  system  for  information  organization.  In  the  near  future, 
we  plan  to  expand  our  experimental  collection  to  demonstrate  the  performance  of  our  algorithms  when 
dealing  with  hundreds  of  thousands  of  documents. 
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Figure  9:  The  effect  of  the  sample  size  on  the  rate  of  the  Linear  Space  Sampled  Star  algorithm  (plotted 
on  a  logarithmic  scale). 


Similarity  et  Dittribuud  Star  Algorithm  to  Star  Algorithm 


Figure  10:  This  graph  shows  the  E-measure  of  the  distributed  star  algorithm  relative  to  the  off-line  star 
clustering  of  the  same  document  set. 
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Cluster  hierarchy 

time  efficient 
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Using  one  cluster 

use  training  data  to  find  clustering  threshold,  relevant  cluster 
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FBIS:  center  .27  cluster  .29  star  .29 

AP:  center  .27  cluster  .30  star  .31 

News:  center  .89  cluster  .91  star  .91 
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easy  topic  exploration  and  query  refinement 

cluster  center  document  summarizes  clusters 

relevance  feedback  on  cluster  centers  only 

with  clustering  organization  can  retrieve  qU  relevant  documents 
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Interactive  applications:  organized  retrieval, 

filtering  (persistent  queries) 


! 


108 


1/4/01 


Future  Directions 


Visualization 
Topic  summarization 


<8  o 


O 


◄  P-< 


>* 


111 


Daniela  Rus 

Dartmouth  Computer  Science 


Misunderstood  query 


112 


Clickhescfoj  i  list oflfttesrut Keywords «l»t£d. to MMkunAovsIdpajr 


Information  Organization 


5 


'Ki 


JS  «*: 

«  Si 

CS  Q 

2 

o 

u  * 

s 

Ou  ^ 

a 

£f) 

fc. 

cl  £ 

•5 

2 

5n 

OJD  i 

2 

.5  i 

b 

*2 

-■*-  - 

#s  * 

2 

'♦o* 

*2 

'w 

5*5 

&  2 

s: 

2 

£  S 

manageable  amount  of  information 

helps  user  to  navigate  and  narrow  the  request 


Hybrid  applications 

i.e.,  dynamic  collection  changing  user  interests 
modern  information  system  trend 


Of) 

U 

Pm 

o 

0> 

P 

+S  o 
P  CJ 

.2  ** 

o  ns 

s  .s 

to  to 


a 

•a 

• 

a 

a 

a 

a 

to 

a 

a 

a 

a 


Xfl 

p 

s 

3 

CJ 

O 

73 

<u 

to-» 

CJ 

% 

73 


73 


5^ 

57 

a 

to 

a 

a 

a 

a 


§■ 


g 

a 

a 

a 

a 

•a 


~  t? 

xn  *a 

.a  -s 

Pm  3 
o  a 
+■*  *5 

£  £ 
p  a 


Xfl 

QJ 

U 

P 

£ 

S 

s 

Cfl 


t» 

a> 

-3 

CJ 

08 

~  u 

•2  P*  to' 

■P  O 

u  _  a 

•  pM  fli 

a  tj  ^ 

o 


<*) 

•H  ns 

&  a,  *5 


p-  ■§ 

P  ^ 

o  § 

s  u  •* 


■p 

p 


erf 

s 

2 

? 

to 

nS? 

a 

a 


a 

a 

Sp 

2 

8- 

to 

a 

5 

a 


fl 

O 

08 

+* 

e 

o 

t/3 

OJ 

Ph 


a 

a 


i 

a 

a 

*a 

a 

.N 

•a 

a 

a 

.05 

to 

*a 

a 

a 

a 

a 

*a 

a 

.*N? 

to 

a 


a 

>3 


115 


•  • 


P 

P 

WD 

U 

o 

u 

a 


manual ,  classification,  clustering 


Related  Work 


V 

U 

CJ 

X 


X 

OD 

•  PN 

X 

S 

U 

« 

X 

u 

o 

Z 

X 

OS 

cn  0\ 


<*) 

c 

o 

ca 

u 

"a 

a 

◄ 


s 

sp 

s 

53 

'£ 

S 


a 

a> 

t/3 

J* 

a 

T3 

a> 

ft-. 


Vl 

u 


o 

o 

X 

CO 

•  ^ 

OS 

Os 


a> 

X 

to 

o 

a> 


- 

a> 


oo 

os 

os 


eo 

-*■» 

(U 

s 

e 

CO 

a 

X 

o 

•— 

W 


wa 


J- 

O 

JOf 

’« 

s 

o 

’X 

co 

c-> 

X 


OX) 

fi 

•PN 

X 

<y 

CO 

i- 

X 

fl 

co 

s 

o 

•  — 

<u 

a> 

X 


& 

o 

«rs 

0£ 

C 

•  PM 

.a 

a 

« 

s-> 

CO 

X 


CO 

<u 


>3 

st 

S3 


<u 

u 

"O 

a> 


Si  ® 
S3  X 
•  JS  •— * 


.s: 

CO 

O 

ffl 

•1 

CO 

C J 
/I 

sc 

St 

V 

CO 

•«— 

a 

Si 

• 

1 

• 

‘St 

Si 

• 

# 

g 

O 

t, 

Zl 

T5 

O 


o> 


Cluster  hypothesis 

•  similar  documents  are  relevant  to  the  same  requests  (van  Rijsbergen  1971) 
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Using  one  cluster 

use  training  data  to  find  clustering  threshold,  relevant  cluster 
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Results  (F  avg) 

FBIS :  center  .27  cluster  ,29  star  ,29 

AP :  center  ,27  cluster  ,30  star  .31 

News:  center  .89  cluster  .91  star  .91 
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relevance  feedback  on  cluster  centers  only 
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